Data Ingestion¶

BP For a multi-threaded application, use either:
- separate connection for each thread
- execute threads synchronously
Uploading using Web UI is limited to 50 MB
can read from S3 Glacier Instant Retrieval
copy into v/s insert: copy into provides file management and restricts allowed transformations

Bulk Load¶

BP recommended file size 100-250 MB = for parquet set use_vectorized_scanner = true to load data in parallel (default false for now)
supported file formats: CSV, JSON, XML, Avro, ORC, Parquet
specifying files to be loaded using:
1. FILES: explicitly names the files to be loaded; fastest; can specify only up to 1000 files
2. PATTERN: usually the slowest, and works differently in bulk v/s streaming
  - BP: prefer using cloud provider's event filtering over pattern matching for better performance
3. prefix/path
required AWS permissions: s3:GetBucketLocation, s3:GetObject, s3:GetObjectVersion, s3:ListBucket
use VALIDATE('sch.tbl', job_id => <qid> | _last) table function for post validation
- _last refers to last copy into statement regardless of sch.tbl specified
data validation (pre or post) does not support copy into that has transformations
copy into options priority order: statement > stage definition > table definition
- copy into options set at different levels are cumulative, whereas sub-options (e.g. file_format) are not cumulative
uncompressed files will be gzip compressed, unless explicitly disabled, when staging on Snowflake stage
Streams undocumented offset: SELECT ... FROM ... AT (STREAM => 'MY_STREAM') ...
available metadata columns: METADATA$FILENAME, METADTATA$FILE_ROW_NUMBER
Load metadata is stored for each table and includes
- name, size, eTag, rows parsed, last load timestamp, errors
Loading semi-structured data into variant, Snowflake tries to extracts individual elements into columnar storage. Unsupported
- attributes with null values, use JSON: STRIP_NULL_VALUES to remove them
- attribute with multiple different types of values
- number of attributes exceed 200

Options ¶

force: reload previously loaded/uncertain files
load_uncertain_files: files kept beyond 64 days aren't in metadata, and thus are uncertain if they have been loaded or not
on_error:
- continue: only discard rows which have errors
- skip_file, skip_file_<n>, skip_file_<n>%: skip files that have error, have more than n errors, have more than n% rows in error
- abort_statement: abort the copy statement
validation_mode (load mode): doesn't actually load the data. Not applicable while also specifying transformation
- return_<N>_rows: validates <N> rows
- return_errors: returns all errors in all files.
- return_all_errors: returns all errors in all files and previous partially loaded files due to on_error=continue
purge: remove file after loading. If remove fails, the command still ends with success

Transformations¶

supports only named and user stages, CSV, JSON, Avro, ORC, Parquet and XML data
CSV files support column reordering, omission, casts, truncation
unsupported:
- flatten isn't supported with copy into, but can be used with insert into
- neither supported are GROUP BY or JOIN
- filter using WHERE or LIMIT
- validation_mode parameter
can use SQL UDF
CSV transformations: column reordering, column exclusions, using sequence/identity
Semi-structured transformations: split() into an array
??? Errors during transformation results in default error handling regardless of ON_ERROR option (i.e. ABORT_STATEMENT for COPY or SKIP_FILE for Snowpipe)

Bulk Unload¶

VALIDATION_MODE = RETURN_ROWS (unload mode): Validates the query but doesn't unload to the specified file
file names when using partitioning: data_ + PARTITION BY values + UUID
file names when no partitioning and path doesn't specify a prefix then data_ as prefix is used
Unloading creates multiple files of size 16MB compressed by default. Can specify SINGLE option to create single file
number of files generated depends on WH size, but is not always 8 per core. E.g. Both L and XL can generate same number (384) of files
supported file format: CSV, JSON and Parquet
requires s3:DeleteObject and s3:PutObject permissions; s3:ListObject is not needed

S3-compat¶

create external stage, specify
- URL = s3compat://<bucket>...
- ENDPOINT = <endpoint>
no storage integration is needed

Spark connector¶

offers high speed data transfer and query pushdown from Spark to Snowflake
- only using JDBC driver doesn't support query pushdown, and is slow
- Spark connector analyzes Spark logical plans and converts most of the operators into SQL
can use either internal (Snowflake managed) or (now deprecated) external storage (user managed)
converts each DF partition to one file in stage. Performance is directly related to number of partitions
Converts a DataFrame to CSV files if DF is simple, or to JSON if it has any complex data types (map/array)
Small result-sets are read through driver node, whereas larger result-sets are read through executor nodes
Arrow is used for reads (v2.6+), stages are used for writes

Data Ingestion¶

Bulk Load¶

Options¶

Transformations¶

Bulk Unload¶

S3-compat¶

Spark connector¶

Options ¶