Function/Proc¶

UDF: SQL, Javascript, Python, Scala, Java or External
Java/Scala UDFs are compiled each time if no TARGET_PATH specified
can be secured or regular
can be TEMPORARY, but SQL and JS are not documented
A SQL NULL can be checked with V === undefined whereas a variant/JSON null can be checked with V === null
MEMOIZABLE functions support arguments of types VARCHAR, NUMBER, TIMESTAMP, BOOLEAN
privileges needed:
owner: usage on database, schema; usage or read on external or internal stage respectively; ownership on UDF
caller: usage on database, schema; usage or read on external or internal stage respectively; usage on UDF
UDF: SQL NULL translates to Python None except for Variant. For variant, JSON null translates to Python None, and SQL NULL translates to a Python object which has a method is_sql_null which returns True
The package dependency versions are frozen when the UDF is created. To update to a newer version, recreate the UDF

supports	UDF	SP
use SQL	No ¹	Yes
Return value	Scalar/Table	Optional, Scalar (or table for SQL SP)
run as	owner	owner or caller

Functions	SQL	JS	Python	Java	Scala
SQL	Y
Tabular	Y	Y	Y	Y
Aggregate			Y
Read static files from stage			Y	Y	Y
Memoizable - Scalar only	Y
Code in stage?			Y	Y	Y
Shareable	Y	Y
Replicable - Inline / All	Y	Y	I
Clone	Y	Y	1

If Python UDF being cloned references a stage, the stage must be outside of the schema being cloned

Python Vectorized UD(T)F¶

Function receives input as pandas data-frame each row containing arguments and output is also a pandas series/numpy ndarray/Python sequence.
Optimized for vector operations; UDF must be decorated with @vectorized decorator
A UDF must execute within 180 seconds, use max_batch_size decorator parameter to limit batch size
A UDTF can have either vectorized end_partition method or vectorized process method, bot not both
For vectorized end_partition, do not define process method, end_partition accepts DataFrame as input

Lang	JS	Java
init	`initialize(argInfo, context)`	class constr no args
process	`processRow(row, rowWriter, context)`	`public Stream<OutputRow> process(arg1,...)`
end	`finalize(rowWriter, context)`	`endPartition()`
class		`public static Class getOutputClass()`

emit output lazily (to conserve memory), using a java Stream or Python Sequence (yield)
to maintain per partition state for UDTF, initialize in the constructor
to maintain common state across partitions, use static variables/block to initialize per class loader, or Python module level variables for per Python interpreter (e.g. creating an instance of XML parser class to be used across JVM/interpreter)

static files from stage:
Python: local name can be formed using sys._xoptions["snowflake_import_directory"]
Java: files are available in UDF's home directory
dynamic files from stage passed as parameter; parameter is of type string;
Python: can be read using SnowflakeFile
Java: Either using SnowflakeFile or InputStream
BP: use build_scoped_file_url to pass file name especially if caller is not the function owner

allowed only to /tmp directory, which is created per query
for unique temporary directory, use Java Files.createTemporaryDirectory or use UUID for Python
for creating shared directory, use java static block or variable assignment; or for Python use fcntl.lockf(fd, fcntl.LOCK_EX) (example)
Python UDFs might be running in different processes (because of multiple interpreter instances running)

only scalar UDFs are supported
REST API endpoints must accept and return JSON values using HTTP POST request
AWS supports two types: regional or private
can be async, workflow:
the initial POST request returns 202 after validation and starting background work process
Snowflake GET request with the same headers as the original POST with delay between each request
the final GET request returns code other than 202, such as, 302 redirect, 200 or error
can be defined as IMMUTABLE (deterministic) or VOLATILE (called every time)
cannot be part of a SHARE
invoked through proxy (eg AWS gateway API)
Snowflake, proxy and the actual REST API endpoint all can be in different region or cloud provides
the DDL contains
Resource URL in proxy service
name of the api integration for authentication
function name, i.e. an alias for the REST API
Snowflake can send large number of requests, any failures are recorded as retries due to transient failure in query profile
remote service can return 429 to indicate Snowflake should scale back sending concurrent requests
AWS API gateway provides caching and throttling

Validates data in all types (except Hybrid) of Tables and Views
limited to 10,000 DMFs per account
System DMFs include {blank,null}_{count,percent}, avg/min/max/stddev, row_count
Custom DMF accept table as argument:
can accept multiple tables (e.g. referential checking)
DMFs can be associated with objects by a two step process, first setting a schedule and then associate a DMF with table/view
set table parameter data_metric_schedule for the object e.g. alter table hr.tables.empl_info set data_metric_schedule = '<schedule>'
schedule can be an interval, cron or indicate trigger (TRIGGER_ON_CHANGES)
associate DMFs with the table/view: alter table t add data metric function snowflake.core.null_count on (c1)
A DMF can be called manually: select DMF(<query>)
Viewing results
Dedicated event view/table and function: snowflake.local.data_quality_monitoring_results[_raw]()
- application roles: snowflake.data_quality_monitoring_{admin, viewer, lookup} allow access to table, view or function
- also select on the table and usage on DMF
Identify individual records that failed data quality checks: system$data_metric_scan
DMF can run incrementally, provided certain conditions are met

parameters:

PATH: sub-path within the expression to be expanded
OUTER: if true at least one row is generated even if PATH cannot be accessed. NULL values for KEY, INDEX, VALUE
MODE: OBJECT, ARRAY or BOTH (default)
RECURSIVE: expand all sub-elements too (default false); use metadata columns THIS to determine the value being exploded

with <name> as procedure(<args>) ....
call <name> (<args>)

owner-rights stored-proc limitations:
cannot alter session state (e.g. session variables, current database etc)
cannot query information_schema table functions that return data based on current_user() (e.g. automatic_clustering_history())
SQL Stored-procedure can return a table
SQL stored-procedures are converted to javascript internally
A transaction started in one scope must end in the same scope. IOW, A stored-procedure call either must be inside a transaction, or the transaction must be inside the stored-procedure call.
available Javascript objects:
Snowflake: available by default
Statement: provides a way to execute SQL
ResultSet: results of a query execution
SfDate: corresponds to SQL TIMESTAMP data type
A nested caller right proc is called as caller right iff all calls in hierarchy are caller right. An owner right sp is always called as owner right
stored procedures can have optional arguments (parameter names defined with DEFAULT clause)