Skip to content

Iceberg Tables

  • Snowflake supports:
    • parquet files
    • hidden partitioning: unlike hadoop, supports partitioning on column expression, the computed partitioning column is hidden
    • cross-platform/region external volumes: for Snowflake managed iceberg tables, Snowflake will bill data transfer fees
  • V1 and V2 supported, except:
    • Row-level deletes
    • history.expire.min-snapshots-to-keep table property
    • tables with partition spec that refer identity or if the source column doesn't exist in the parquet file
    • data type unsigned int
    • temporary/transient tables, cloning, replication, sharing
    • clustering tables defined using catalog integration
  • V2 format offers copy-on-write or merge-on-read strategies for updates and deletes
    • copy-on-write: rewrites affected files. It is best suited for frequent reads, large but infrequent updates
    • merge-on-read: (currently unsupported) maintains a delete file, and no rewrites. It is best suited for frequent writes/updates

Storage

  • Uses external volume
  • base_location_prefix can be set at account/database/schema levels
  • file names: storage_base_url/<table-path>/[data | metadata]/
base_location base_location_prefix table-path
Yes * base_location.randomId
- Yes base_location_prefix/<table>.randomId
- - <database>/<schema>/<table_name>.randomId

Catalog

  • Catalog can be managed by Snowflake or externally (e.g. AWS Glue) using Catalog Integration
  • For unmanaged Iceberg tables, DATA_RETENTION_TIME_IN_DAYS is set to smaller of history.expire.max-snapshot-age-ms or 1 for standard edition or 5 for enterprise edition

Catalog Integration

  • for creating iceberg tables that are managed externally. Following types using catalog_source (required)
    1. AWS Glue (glue)
    2. Object Storage (object_storage)
    3. Snowflake Open Catalog (polaris)
    4. Iceberg REST (iceberg_rest)
    5. SAP Business Data Cloud (sap_bdc)
  • table_format is always iceberg, except for catalogs using object storage which also supports delta
  • create catalog integration .... Options: -

Catalog Linked Database CLD

  • A database that is linked to a catalog integration, and optionally an external volume if catalog integration does not support vended credentials
    • nested namespaces can be ignored or flattened
  • A task runs every 5 minutes to discover newly added schemas and tables.
    • does not yet support dropped or renamed schemas/tables
  • All discovered schemas and tables are owned by the same role that owns CLD
  • For write-enabled CLD, dropping a table is committed to the catalog and thus is also dropped in ICEBERG catalog
    • Support can use SYSTEM$UNLINK_TABLE to remove table from CLD but not drop in ICEBERG catalog
  • Use SYSTEM$CATALOG_LINK_STATUS to check for any failing sync

Architecture

  • Architecture compared to Hive, which is directory based (i.e. all files in a directory, which is also a partition), Iceberg's metadata contains list of files. This allows
    • better performance on cloud storage by tracking individual files instead of directories
      • files can have different prefixes
      • costly directory listings are avoided
    • transactional abilities, directories are managed by filesystem and aren't transactional
    • time-travel (iceberg supports snapshots)
    • can support compaction via background process
  • Iceberg table consists of 3 layers
    • catalog: contains table and its current metadata-file
    • metadata:
      • metadata-file: contains schema, partition information, list of snapshots (manifest list), and which snapshot is current
      • manifest list: A list of all manifest files that make up a snapshot
      • manifest file: tacks a subset of data files along with details (file type such as parquet) and statistics (such as row count, null counts, values, min, max etc)
    • data: typically parquet files