Function/Proc
- UDF: SQL, Javascript, Python, Scala, Java or External
- Java/Scala UDFs are compiled each time if no
TARGET_PATH specified
- can be secured or regular
- can be
TEMPORARY, but SQL and JS are not documented
- A SQL
NULL can be checked with V === undefined whereas a variant/JSON null can be checked with V === null
MEMOIZABLE functions support arguments of types VARCHAR, NUMBER, TIMESTAMP, BOOLEAN
- privileges needed:
- owner:
usage on database, schema; usage or read on external or internal stage respectively; ownership on UDF
- caller:
usage on database, schema; usage or read on external or internal stage respectively; usage on UDF
- UDF: SQL
NULL translates to Python None except for Variant. For variant, JSON null translates to Python None, and SQL NULL translates to a Python object which has a method is_sql_null which returns True
- The package dependency versions are frozen when the UDF is created. To update to a newer version, recreate the UDF
| supports |
UDF |
SP |
| use SQL |
No |
Yes |
| Return value |
Scalar/Table |
Optional, Scalar (or table for SQL SP) |
| run as |
owner |
owner or caller |
| Functions |
SQL |
JS |
Python |
Java |
Scala |
| SQL |
Y |
|
|
|
|
| Tabular |
Y |
Y |
Y |
Y |
|
| Aggregate |
|
|
Y |
|
|
| Read static files from stage |
|
|
Y |
Y |
Y |
| Memoizable - Scalar only |
Y |
|
|
|
|
| Code in stage? |
|
|
Y |
Y |
Y |
| Shareable |
Y |
Y |
|
|
|
| Replicable - Inline / All |
Y |
Y |
I |
|
|
| Clone |
Y |
Y |
1 |
|
|
- If Python UDF being cloned references a stage, the stage must be outside of the schema being cloned
Python Vectorized UD(T)F
- Function receives input as pandas data-frame each row containing arguments and output is also a pandas series/numpy
ndarray/Python sequence.
- Optimized for vector operations; UDF must be decorated with
@vectorized decorator
- A UDF must execute within 180 seconds, use
max_batch_size decorator parameter to limit batch size
- A UDTF can have either vectorized
end_partition method or vectorized process method, bot not both
- For vectorized
end_partition, do not define process method, end_partition accepts DataFrame as input
UDTF
| Lang |
JS |
Java |
| init |
initialize(argInfo, context) |
class constr no args |
| process |
processRow(row, rowWriter, context) |
public Stream<OutputRow> process(arg1,...) |
| end |
finalize(rowWriter, context) |
endPartition() |
| class |
|
public static Class getOutputClass() |
BP
- emit output lazily (to conserve memory), using a java Stream or Python Sequence (
yield)
- to maintain per partition state for UDTF, initialize in the constructor
- to maintain common state across partitions, use static variables/block to initialize per class loader, or Python module level variables for per Python interpreter (e.g. creating an instance of XML parser class to be used across JVM/interpreter)
UDAF
- Only supported by Python
- can't be immutable, can't use as window functions
- handler must be a Python class with following methods
__init__: initialize internal state
aggregate_state: return state; must be a method with @property decorator
accumulate(input_value): update aggregate state with the input row
merge(<other-state>): merge two aggregate states
finish(<state>): produce final result based on aggregate state
IO
Reading
- static files from stage:
- Python: local name can be formed using
sys._xoptions["snowflake_import_directory"]
- Java: files are available in UDF's home directory
- dynamic files from stage passed as parameter; parameter is of type
string;
- Python: can be read using
SnowflakeFile
- Java: Either using
SnowflakeFile or InputStream
- BP: use
build_scoped_file_url to pass file name especially if caller is not the function owner
Writing
- allowed only to
/tmp directory, which is created per query
- for unique temporary directory, use Java
Files.createTemporaryDirectory or use UUID for Python
- for creating shared directory, use java static block or variable assignment; or for Python use
fcntl.lockf(fd, fcntl.LOCK_EX) (example)
- Python UDFs might be running in different processes (because of multiple interpreter instances running)
External UDF
- only scalar UDFs are supported
- REST API endpoints must accept and return JSON values using HTTP
POST request
- AWS supports two types: regional or private
- can be async, workflow:
- the initial
POST request returns 202 after validation and starting background work process
- Snowflake
GET request with the same headers as the original POST with delay between each request
- the final
GET request returns code other than 202, such as, 302 redirect, 200 or error
- can be defined as
IMMUTABLE (deterministic) or VOLATILE (called every time)
- cannot be part of a
SHARE
- invoked through proxy (eg AWS gateway API)
- Snowflake, proxy and the actual REST API endpoint all can be in different region or cloud provides
- the DDL contains
- Resource URL in proxy service
- name of the
api integration for authentication
- function name, i.e. an alias for the REST API
- Snowflake can send large number of requests, any failures are recorded as retries due to transient failure in query profile
- remote service can return 429 to indicate Snowflake should scale back sending concurrent requests
- AWS API gateway provides caching and throttling
Data Metric Function
- Validates data in all types (except Hybrid) of Tables and Views
- limited to 10,000 DMFs per account
- System DMFs include
{blank,null}_{count,percent}, avg/min/max/stddev, row_count
- Custom DMF accept table as argument:
- can accept multiple tables (e.g. referential checking)
- DMFs can be associated with objects by a two step process, first setting a schedule and then associate a DMF with table/view
- set table parameter
data_metric_schedule for the object e.g. alter table hr.tables.empl_info set data_metric_schedule = '<schedule>'
- schedule can be an interval, cron or indicate trigger (
TRIGGER_ON_CHANGES)
- associate DMFs with the table/view:
alter table t add data metric function snowflake.core.null_count on (c1)
- A DMF can be called manually:
select DMF(<query>)
- Viewing results
- Dedicated event view/table and function:
snowflake.local.data_quality_monitoring_results[_raw]()
- application roles:
snowflake.data_quality_monitoring_{admin, viewer, lookup} allow access to table, view or function
- also
select on the table and usage on DMF
- Identify individual records that failed data quality checks:
system$data_metric_scan
- DMF can run incrementally, provided certain conditions are met
FLATTEN
parameters:
PATH: sub-path within the expression to be expanded
OUTER: if true at least one row is generated even if PATH cannot be accessed. NULL values for KEY, INDEX, VALUE
MODE: OBJECT, ARRAY or BOTH (default)
RECURSIVE: expand all sub-elements too (default false); use metadata columns THIS to determine the value being exploded
Stored Proc
- Can use Snowflake Scripting, JavaScript, Snowpark (Python, Scala, Java)
- Anonymous procedures do not require
create schema privilege
with <name> as procedure(<args>) ....
call <name> (<args>)
- owner-rights stored-proc limitations:
- cannot alter session state (e.g. session variables, current database etc)
- cannot query
information_schema table functions that return data based on current_user() (e.g. automatic_clustering_history())
- SQL Stored-procedure can return a table
- SQL stored-procedures are converted to javascript internally
- A transaction started in one scope must end in the same scope. IOW, A stored-procedure call either must be inside a transaction, or the transaction must be inside the stored-procedure call.
- available Javascript objects:
Snowflake: available by default
Statement: provides a way to execute SQL
ResultSet: results of a query execution
SfDate: corresponds to SQL TIMESTAMP data type
- A nested caller right proc is called as caller right iff all calls in hierarchy are caller right. An owner right sp is always called as owner right
- stored procedures can have optional arguments (parameter names defined with
DEFAULT clause)