Skip to content

Data Ingestion

  • BP For a multi-threaded application, use either:
    • separate connection for each thread
    • execute threads synchronously
  • Uploading using Web UI is limited to 50 MB
  • can read from S3 Glacier Instant Retrieval
  • copy into v/s insert: copy into provides file management and restricts allowed transformations

Bulk Load

  • BP recommended file size 100-250 MB = for parquet set use_vectorized_scanner = true to load data in parallel (default false for now)
  • supported file formats: CSV, JSON, XML, Avro, ORC, Parquet
  • specifying files to be loaded using:
    1. FILES: explicitly names the files to be loaded; fastest; can specify only up to 1000 files
    2. PATTERN: usually the slowest, and works differently in bulk v/s streaming
      • BP: prefer using cloud provider's event filtering over pattern matching for better performance
    3. prefix/path
  • required AWS permissions: s3:GetBucketLocation, s3:GetObject, s3:GetObjectVersion, s3:ListBucket
  • use VALIDATE('sch.tbl', job_id => <qid> | _last) table function for post validation
    • _last refers to last copy into statement regardless of sch.tbl specified
  • data validation (pre or post) does not support copy into that has transformations
  • copy into options priority order: statement > stage definition > table definition
    • copy into options set at different levels are cumulative, whereas sub-options (e.g. file_format) are not cumulative
  • uncompressed files will be gzip compressed, unless explicitly disabled, when staging on Snowflake stage
  • Streams undocumented offset: SELECT ... FROM ... AT (STREAM => 'MY_STREAM') ...
  • available metadata columns: METADATA$FILENAME, METADTATA$FILE_ROW_NUMBER
  • Load metadata is stored for each table and includes
    • name, size, eTag, rows parsed, last load timestamp, errors
  • Loading semi-structured data into variant, Snowflake tries to extracts individual elements into columnar storage. Unsupported
    • attributes with null values, use JSON: STRIP_NULL_VALUES to remove them
    • attribute with multiple different types of values
    • number of attributes exceed 200

Options

  • force: reload previously loaded/uncertain files
  • load_uncertain_files: files kept beyond 64 days aren't in metadata, and thus are uncertain if they have been loaded or not
  • on_error:
    • continue: only discard rows which have errors
    • skip_file, skip_file_<n>, skip_file_<n>%: skip files that have error, have more than n errors, have more than n% rows in error
    • abort_statement: abort the copy statement
  • validation_mode (load mode): doesn't actually load the data. Not applicable while also specifying transformation
    • return_<N>_rows: validates <N> rows
    • return_errors: returns all errors in all files.
    • return_all_errors: returns all errors in all files and previous partially loaded files due to on_error=continue
  • purge: remove file after loading. If remove fails, the command still ends with success

Transformations

  • supports only named and user stages, CSV, JSON, Avro, ORC, Parquet and XML data
  • CSV files support column reordering, omission, casts, truncation
  • unsupported:
    • flatten isn't supported with copy into, but can be used with insert into
    • neither supported are GROUP BY or JOIN
    • filter using WHERE or LIMIT
    • validation_mode parameter
  • can use SQL UDF
  • CSV transformations: column reordering, column exclusions, using sequence/identity
  • Semi-structured transformations: split() into an array
  • ??? Errors during transformation results in default error handling regardless of ON_ERROR option (i.e. ABORT_STATEMENT for COPY or SKIP_FILE for Snowpipe)

Bulk Unload

  • VALIDATION_MODE = RETURN_ROWS (unload mode): Validates the query but doesn't unload to the specified file
  • file names when using partitioning: data_ + PARTITION BY values + UUID
  • file names when no partitioning and path doesn't specify a prefix then data_ as prefix is used
  • Unloading creates multiple files of size 16MB compressed by default. Can specify SINGLE option to create single file
  • number of files generated depends on WH size, but is not always 8 per core. E.g. Both L and XL can generate same number (384) of files
  • supported file format: CSV, JSON and Parquet
  • requires s3:DeleteObject and s3:PutObject permissions; s3:ListObject is not needed

S3-compat

  • create external stage, specify
    • URL = s3compat://<bucket>...
    • ENDPOINT = <endpoint>
  • no storage integration is needed

Spark connector

  • offers high speed data transfer and query pushdown from Spark to Snowflake
    • only using JDBC driver doesn't support query pushdown, and is slow
    • Spark connector analyzes Spark logical plans and converts most of the operators into SQL
  • can use either internal (Snowflake managed) or (now deprecated) external storage (user managed)
  • converts each DF partition to one file in stage. Performance is directly related to number of partitions
  • Converts a DataFrame to CSV files if DF is simple, or to JSON if it has any complex data types (map/array)
  • Small result-sets are read through driver node, whereas larger result-sets are read through executor nodes
  • Arrow is used for reads (v2.6+), stages are used for writes