Data Ingestion
BP For a multi-threaded application, use either :
separate connection for each thread
execute threads synchronously
Uploading using Web UI is limited to 50 MB
can read from S3 Glacier Instant Retrieval
copy into v/s insert: copy into provides file management and restricts allowed transformations
Bulk Load
BP recommended file size 100-250 MB
= for parquet set use_vectorized_scanner = true to load data in parallel (default false for now)
supported file formats: CSV, JSON, XML, Avro, ORC, Parquet
specifying files to be loaded using:
FILES: explicitly names the files to be loaded; fastest; can specify only up to 1000 files
PATTERN: usually the slowest, and works differently in bulk v/s streaming
BP: prefer using cloud provider's event filtering over pattern matching for better performance
prefix/path
required AWS permissions: s3:GetBucketLocation, s3:GetObject, s3:GetObjectVersion, s3:ListBucket
use VALIDATE('sch.tbl', job_id => <qid> | _last) table function for post validation
_last refers to last copy into statement regardless of sch.tbl specified
data validation (pre or post) does not support copy into that has transformations
copy into options priority order : statement > stage definition > table definition
copy into options set at different levels are cumulative , whereas sub-options (e.g. file_format) are not cumulative
uncompressed files will be gzip compressed, unless explicitly disabled, when staging on Snowflake stage
Streams undocumented offset: SELECT ... FROM ... AT (STREAM => 'MY_STREAM') ...
available metadata columns: METADATA$FILENAME, METADTATA$FILE_ROW_NUMBER
Load metadata is stored for each table and includes
name, size, eTag, rows parsed, last load timestamp, errors
Loading semi-structured data into variant, Snowflake tries to extracts individual elements into columnar storage. Unsupported
attributes with null values, use JSON: STRIP_NULL_VALUES to remove them
attribute with multiple different types of values
number of attributes exceed 200
force: reload previously loaded/uncertain files
load_uncertain_files: files kept beyond 64 days aren't in metadata, and thus are uncertain if they have been loaded or not
on_error:
continue: only discard rows which have errors
skip_file, skip_file_<n>, skip_file_<n>%: skip files that have error, have more than n errors, have more than n% rows in error
abort_statement: abort the copy statement
validation_mode (load mode): doesn't actually load the data. Not applicable while also specifying transformation
return_<N>_rows: validates <N> rows
return_errors: returns all errors in all files.
return_all_errors: returns all errors in all files and previous partially loaded files due to on_error=continue
purge: remove file after loading. If remove fails, the command still ends with success
supports only named and user stages , CSV, JSON, Avro, ORC, Parquet and XML data
CSV files support column reordering, omission, casts, truncation
unsupported:
flatten isn't supported with copy into, but can be used with insert into
neither supported are GROUP BY or JOIN
filter using WHERE or LIMIT
validation_mode parameter
can use SQL UDF
CSV transformations: column reordering, column exclusions, using sequence/identity
Semi-structured transformations: split() into an array
??? Errors during transformation results in default error handling regardless of ON_ERROR option (i.e. ABORT_STATEMENT for COPY or SKIP_FILE for Snowpipe)
Bulk Unload
VALIDATION_MODE = RETURN_ROWS (unload mode): Validates the query but doesn't unload to the specified file
file names when using partitioning: data_ + PARTITION BY values + UUID
file names when no partitioning and path doesn't specify a prefix then data_ as prefix is used
Unloading creates multiple files of size 16MB compressed by default. Can specify SINGLE option to create single file
number of files generated depends on WH size, but is not always 8 per core. E.g. Both L and XL can generate same number (384) of files
supported file format: CSV, JSON and Parquet
requires s3:DeleteObject and s3:PutObject permissions; s3:ListObject is not needed
S3-compat
create external stage, specify
URL = s3compat://<bucket>...
ENDPOINT = <endpoint>
no storage integration is needed
Spark connector
offers high speed data transfer and query pushdown from Spark to Snowflake
only using JDBC driver doesn't support query pushdown, and is slow
Spark connector analyzes Spark logical plans and converts most of the operators into SQL
can use either internal (Snowflake managed) or (now deprecated) external storage (user managed)
converts each DF partition to one file in stage. Performance is directly related to number of partitions
Converts a DataFrame to CSV files if DF is simple, or to JSON if it has any complex data types (map/array)
Small result-sets are read through driver node, whereas larger result-sets are read through executor nodes
Arrow is used for reads (v2.6+), stages are used for writes