Features
Time-travel
data retention can be specified at account, database, schema or table level, the most granular wins
Account parameter MIN_DATA_RETENTION_TIME_IN_DAYS determines the minimum effective time-travel period
Increasing/decreasing time-travel period affects data that is already in time-travel and not which is in fail-safe.
reducing time-travel period is not effective immediately until a background process picks up the change
not supported on shared data
Specified using AT|BEFORE
Timestamp values are specified as strings and unless using explicit cast, they are treated as timestamp_ntz (same as UTC)
FROM clause can also specify CHANGES in which case changes made since AT|BEFORE are returned
Data Clean Room
Provider needs to have a Enterprise+ edition, consumer can be non-Snowflake user
One environment per account that supports multiple clean rooms
Uses differential privacy : mathematical tools to add noise to hide the originating data
can be managed using Web UI or API
steps:
Provider creates a clean room
imports tables/views, and SQL templates , or Python code
join policies: which columns are allowed for joining
segmentation columns: analyst can create segments only from the specified columns (e.g. income_bracket, age_band)
activation columns: analyst cannot enrich results unless it comes from the activation columns
invites a consumer, who import their own tables/view (collaborator tables)
consumer applies filters, segments
consumer activates the result, i.e. sends data to the provider for enrichment
consumer can download the enriched results
Setup
Sign up for DCR
Add IP address of the DCR Web App to network policy
Configure Snowflake account :
create a service account
Install Snowflake native app (SAMOOHA_BY_SNOWFLAKE)
Register databases that need to be shared
only tables are supported
Unstructured data
stored in internal/external stages, using directory tables
Access using URLs
for URLs to work, choose server side encryption for stages
Can use REST API GET /api/files/ to download data from an internal/external stage
Java, Scala and Python UDFs/SPs can read from URLs dynamically
Enable SSE encryption for an internal stage to allow access to unstructured data
Property
Scoped
File
Presigned
Expiry
result cache (24 hr)
permanent
Expiry time argument
URL type
encoded
path
encoded
Generate
BUILD_SCOPED_FILE_URL
BUILD_STAGE_FILE_URL
GET_PRESIGNED_URL
Permission
user who generated it
access to stage
anyone
Usage
REST API/Snowsight
REST API/Snowsight
Within browser
Data Sharing
Yes
No
Yes
Schema Detection/Evolution
Schema detection
table function infer_schema(location => <stage>, file_format => 'myfmt') gives a list of columns
generate_column_description(<array col-desc>, '<type>') takes array_agg(object_construct(infer_schema(...))) to return column DDLs
create table ... using template (<query-expr>) uses string from the above query to create table
for JSON and CSV files only, entire file is scanned by default to detect schema
also works at folder level by returning all columns across all files
supported when loading avro, parquet, orc, json and csv data
stand-alone feature, but can be used with schema detection
To enable:
table must be created with enable_schema_evolution = true
COPY INTO must specify match_by_column_name (for CSV set parse_header and set error_on_column_count_mismatch to false)
role that loads data must include evolve schema or ownership on table
limitations
does not drop columns, but will make required columns as nullable
per run: max 10 new columns, 1 schema change per copy into
Not supported by Snowpipe streaming SDK; but supported by Kafka Connector with Snowpipe Streaming
create/alter table set enable_schema_evolution = true
when loading multiple files with different schemas, copy into can evolve schema using one file at a time
keep running until copy into succeeds when it evolves schema for all files
schema changes are not rolled back even if copy into fails
Native Apps
A native application consists of manifest (JSON format), setup scripts (SQL), and business logic (resource files) that gets installed on consumer accounts
business logic can be written using Streamlit, stored-procs, UDFs, Snowflake API, javascript and SQL
setup script installs in a local schema: tables, views, UDFs, procs, app roles
can use SQL, Python or Java
can bundle proprietary data (tables and stages)
consumers do not have access to internal data
apps can access external data
versioned
can be included in a listing (free/paid, private/marketplace)
Provider creates application package that includes data, application logic, metadata, setup script and versions/patch levels
Consumer installs the application from the listing
Query Acceleration
Works for queries that are deterministic, scan large amount of data, and return a small result set. E.g.
large scans and highly selective filters or aggregated result is expected to be low cardinality
alter warehouse foo_wh set enable_query_acceleration = true, query_acceleration_max_scale_factor = 10
scale factor is omitted or 0 => unbounded, queries can lease as many resources as needed and available
regardless of scale factor, query uses only the resources needed (if available)
e.g. queries running on a Medium WH can use up to (\(4 * 10\) ) 40 nodes
common reasons for ineligibility:
low partition count
not selective enough (e.g. group by has high cardinality)
non deterministic functions or limit without an order by
supports: select, insert, ctas and copy into <table>
use SYSTEM$ESTIMATE_QUERY_ACCELERATION(<qid>) or QUERY_ACCELERATION_ELIGIBLE view