Skip to content

Function/Proc

  • UDF: SQL, Javascript, Python, Scala, Java or External
  • Java/Scala UDFs are compiled each time if no TARGET_PATH specified
  • can be secured or regular
  • can be TEMPORARY, but SQL and JS are not documented
  • A SQL NULL can be checked with V === undefined whereas a variant/JSON null can be checked with V === null
  • MEMOIZABLE functions support arguments of types VARCHAR, NUMBER, TIMESTAMP, BOOLEAN
  • privileges needed:
    • owner: usage on database, schema; usage or read on external or internal stage respectively; ownership on UDF
    • caller: usage on database, schema; usage or read on external or internal stage respectively; usage on UDF
  • UDF: SQL NULL translates to Python None except for Variant. For variant, JSON null translates to Python None, and SQL NULL translates to a Python object which has a method is_sql_null which returns True
  • The package dependency versions are frozen when the UDF is created. To update to a newer version, recreate the UDF
supports UDF SP
use SQL No (1) Yes
Return value Scalar/Table Optional, Scalar (or table for SQL SP)
run as owner owner or caller
  1. SQL UDFs can use other SQL expressions
Functions SQL JS Python Java Scala
SQL Y
Tabular Y Y Y Y
Aggregate Y
Read static files from stage Y Y Y
Memoizable - Scalar only Y
Code in stage? Y Y Y
Shareable Y Y
Replicable - Inline / All Y Y I
Clone Y Y 1
  1. If Python UDF being cloned references a stage, the stage must be outside of the schema being cloned

Python Vectorized UD(T)F

  • Function receives input as pandas data-frame each row containing arguments and output is also a pandas series/numpy ndarray/Python sequence.
  • Optimized for vector operations; UDF must be decorated with @vectorized decorator
  • A UDF must execute within 180 seconds, use max_batch_size decorator parameter to limit batch size
  • A UDTF can have either vectorized end_partition method or vectorized process method, bot not both
    • For vectorized end_partition, do not define process method, end_partition accepts DataFrame as input

UDTF

Lang JS Java
init initialize(argInfo, context) class constr no args
process processRow(row, rowWriter, context) public Stream<OutputRow> process(arg1,...)
end finalize(rowWriter, context) endPartition()
class public static Class getOutputClass()

BP

  • emit output lazily (to conserve memory), using a java Stream or Python Sequence (yield)
  • to maintain per partition state for UDTF, initialize in the constructor
  • to maintain common state across partitions, use static variables/block to initialize per class loader, or Python module level variables for per Python interpreter (e.g. creating an instance of XML parser class to be used across JVM/interpreter)

UDAF

  • Only supported by Python
  • can't be immutable, can't use as window functions
  • handler must be a Python class with following methods
    • __init__: initialize internal state
    • aggregate_state: return state; must be a method with @property decorator
    • accumulate(input_value): update aggregate state with the input row
    • merge(<other-state>): merge two aggregate states
    • finish(<state>): produce final result based on aggregate state

IO

Reading

  • static files from stage:
    • Python: local name can be formed using sys._xoptions["snowflake_import_directory"]
    • Java: files are available in UDF's home directory
  • dynamic files from stage passed as parameter; parameter is of type string;
    • Python: can be read using SnowflakeFile
    • Java: Either using SnowflakeFile or InputStream
    • BP: use build_scoped_file_url to pass file name especially if caller is not the function owner

Writing

  • allowed only to /tmp directory, which is created per query
  • for unique temporary directory, use Java Files.createTemporaryDirectory or use UUID for Python
  • for creating shared directory, use java static block or variable assignment; or for Python use fcntl.lockf(fd, fcntl.LOCK_EX) (example)
    • Python UDFs might be running in different processes (because of multiple interpreter instances running)

External UDF

  • only scalar UDFs are supported
  • REST API endpoints must accept and return JSON values using HTTP POST request
    • AWS supports two types: regional or private
  • can be async, workflow:
    • the initial POST request returns 202 after validation and starting background work process
    • Snowflake GET request with the same headers as the original POST with delay between each request
    • the final GET request returns code other than 202, such as, 302 redirect, 200 or error
  • can be defined as IMMUTABLE (deterministic) or VOLATILE (called every time)
  • cannot be part of a SHARE
  • invoked through proxy (eg AWS gateway API)
    • Snowflake, proxy and the actual REST API endpoint all can be in different region or cloud provides
  • the DDL contains
    • Resource URL in proxy service
    • name of the api integration for authentication
    • function name, i.e. an alias for the REST API
  • Snowflake can send large number of requests, any failures are recorded as retries due to transient failure in query profile
    • remote service can return 429 to indicate Snowflake should scale back sending concurrent requests
    • AWS API gateway provides caching and throttling

Data Metric Function

  • Validates data in all types (except Hybrid) of Tables and Views
  • limited to 10,000 DMFs per account
  • System DMFs include {blank,null}_{count,percent}, avg/min/max/stddev, row_count
  • Custom DMF accept table as argument:
    • can accept multiple tables (e.g. referential checking)
  • DMFs can be associated with objects by a two step process, first setting a schedule and then associate a DMF with table/view
    1. set table parameter data_metric_schedule for the object e.g. alter table hr.tables.empl_info set data_metric_schedule = '<schedule>'
      • schedule can be an interval, cron or indicate trigger (TRIGGER_ON_CHANGES)
    2. associate DMFs with the table/view: alter table t add data metric function snowflake.core.null_count on (c1)
  • A DMF can be called manually: select DMF(<query>)
  • Viewing results
    • Dedicated event view/table and function: snowflake.local.data_quality_monitoring_results[_raw]()
      • application roles: snowflake.data_quality_monitoring_{admin, viewer, lookup} allow access to table, view or function
      • also select on the table and usage on DMF
    • Identify individual records that failed data quality checks: system$data_metric_scan
  • DMF can run incrementally, provided certain conditions are met

FLATTEN

parameters:

  • PATH: sub-path within the expression to be expanded
  • OUTER: if true at least one row is generated even if PATH cannot be accessed. NULL values for KEY, INDEX, VALUE
  • MODE: OBJECT, ARRAY or BOTH (default)
  • RECURSIVE: expand all sub-elements too (default false); use metadata columns THIS to determine the value being exploded

Stored Proc

  • Can use Snowflake Scripting, JavaScript, Snowpark (Python, Scala, Java)
  • Anonymous procedures do not require create schema privilege
    with <name> as procedure(<args>) ....
    call <name> (<args>)
    
  • owner-rights stored-proc limitations:
    • cannot alter session state (e.g. session variables, current database etc)
    • cannot query information_schema table functions that return data based on current_user() (e.g. automatic_clustering_history())
  • SQL Stored-procedure can return a table
  • SQL stored-procedures are converted to javascript internally
  • A transaction started in one scope must end in the same scope. IOW, A stored-procedure call either must be inside a transaction, or the transaction must be inside the stored-procedure call.
  • available Javascript objects:
    • Snowflake: available by default
    • Statement: provides a way to execute SQL
    • ResultSet: results of a query execution
    • SfDate: corresponds to SQL TIMESTAMP data type
  • A nested caller right proc is called as caller right iff all calls in hierarchy are caller right. An owner right sp is always called as owner right
  • stored procedures can have optional arguments (parameter names defined with DEFAULT clause)