Function/Proc¶
- UDF: SQL, Javascript, Python, Scala, Java or External
- Java/Scala UDFs are compiled each time if no
TARGET_PATHspecified - can be secured or regular
- can be
TEMPORARY, but SQL and JS are not documented - A SQL
NULLcan be checked withV === undefinedwhereas a variant/JSONnullcan be checked withV === null MEMOIZABLEfunctions support arguments of typesVARCHAR,NUMBER,TIMESTAMP,BOOLEAN- privileges needed:
- owner:
usageon database, schema;usageorreadon external or internal stage respectively;ownershipon UDF - caller:
usageon database, schema;usageorreadon external or internal stage respectively;usageon UDF
- owner:
- UDF: SQL
NULLtranslates to Python None except for Variant. For variant, JSONnulltranslates to Python None, and SQLNULLtranslates to a Python object which has a methodis_sql_nullwhich returnsTrue - The package dependency versions are frozen when the UDF is created. To update to a newer version, recreate the UDF
| supports | UDF | SP |
|---|---|---|
| use SQL | No (1) | Yes |
| Return value | Scalar/Table | Optional, Scalar (or table for SQL SP) |
| run as | owner | owner or caller |
- SQL UDFs can use other SQL expressions
| Functions | SQL | JS | Python | Java | Scala |
|---|---|---|---|---|---|
| SQL | Y | ||||
| Tabular | Y | Y | Y | Y | |
| Aggregate | Y | ||||
| Read static files from stage | Y | Y | Y | ||
| Memoizable - Scalar only | Y | ||||
| Code in stage? | Y | Y | Y | ||
| Shareable | Y | Y | |||
| Replicable - Inline / All | Y | Y | I | ||
| Clone | Y | Y | 1 |
- If Python UDF being cloned references a stage, the stage must be outside of the schema being cloned
Python Vectorized UD(T)F¶
- Function receives input as pandas data-frame each row containing arguments and output is also a pandas series/numpy
ndarray/Python sequence. - Optimized for vector operations; UDF must be decorated with
@vectorizeddecorator - A UDF must execute within 180 seconds, use
max_batch_sizedecorator parameter to limit batch size - A UDTF can have either vectorized
end_partitionmethod or vectorizedprocessmethod, bot not both- For vectorized
end_partition, do not defineprocessmethod,end_partitionacceptsDataFrameas input
- For vectorized
UDTF¶
| Lang | JS | Java |
|---|---|---|
| init | initialize(argInfo, context) |
class constr no args |
| process | processRow(row, rowWriter, context) |
public Stream<OutputRow> process(arg1,...) |
| end | finalize(rowWriter, context) |
endPartition() |
| class | public static Class getOutputClass() |
BP¶
- emit output lazily (to conserve memory), using a java Stream or Python Sequence (
yield) - to maintain per partition state for UDTF, initialize in the constructor
- to maintain common state across partitions, use static variables/block to initialize per class loader, or Python module level variables for per Python interpreter (e.g. creating an instance of XML parser class to be used across JVM/interpreter)
UDAF¶
- Only supported by Python
- can't be immutable, can't use as window functions
- handler must be a Python class with following methods
__init__: initialize internal stateaggregate_state: return state; must be a method with@propertydecoratoraccumulate(input_value): update aggregate state with the input rowmerge(<other-state>): merge two aggregate statesfinish(<state>): produce final result based on aggregate state
IO¶
Reading¶
- static files from stage:
- Python: local name can be formed using
sys._xoptions["snowflake_import_directory"] - Java: files are available in UDF's home directory
- Python: local name can be formed using
- dynamic files from stage passed as parameter; parameter is of type
string;- Python: can be read using
SnowflakeFile - Java: Either using
SnowflakeFileorInputStream - BP: use
build_scoped_file_urlto pass file name especially if caller is not the function owner
- Python: can be read using
Writing¶
- allowed only to
/tmpdirectory, which is created per query - for unique temporary directory, use Java
Files.createTemporaryDirectoryor use UUID for Python - for creating shared directory, use java static block or variable assignment; or for Python use
fcntl.lockf(fd, fcntl.LOCK_EX)(example)- Python UDFs might be running in different processes (because of multiple interpreter instances running)
External UDF¶
- only scalar UDFs are supported
- REST API endpoints must accept and return JSON values using HTTP
POSTrequest- AWS supports two types: regional or private
- can be async, workflow:
- the initial
POSTrequest returns 202 after validation and starting background work process - Snowflake
GETrequest with the same headers as the originalPOSTwith delay between each request - the final
GETrequest returns code other than202, such as,302redirect,200or error
- the initial
- can be defined as
IMMUTABLE(deterministic) orVOLATILE(called every time) - cannot be part of a
SHARE - invoked through proxy (eg AWS gateway API)
- Snowflake, proxy and the actual REST API endpoint all can be in different region or cloud provides
- the DDL contains
- Resource URL in proxy service
- name of the
api integrationfor authentication - function name, i.e. an alias for the REST API
- Snowflake can send large number of requests, any failures are recorded as retries due to transient failure in query profile
- remote service can return 429 to indicate Snowflake should scale back sending concurrent requests
- AWS API gateway provides caching and throttling
Data Metric Function¶
- Validates data in all types (except Hybrid) of Tables and Views
- limited to 10,000 DMFs per account
- System DMFs include
{blank,null}_{count,percent},avg/min/max/stddev,row_count - Custom DMF accept table as argument:
- can accept multiple tables (e.g. referential checking)
- DMFs can be associated with objects by a two step process, first setting a schedule and then associate a DMF with table/view
- set table parameter
data_metric_schedulefor the object e.g.alter table hr.tables.empl_info set data_metric_schedule = '<schedule>'- schedule can be an interval, cron or indicate trigger (
TRIGGER_ON_CHANGES)
- schedule can be an interval, cron or indicate trigger (
- associate DMFs with the table/view:
alter table t add data metric function snowflake.core.null_count on (c1)
- set table parameter
- A DMF can be called manually:
select DMF(<query>) - Viewing results
- Dedicated event view/table and function:
snowflake.local.data_quality_monitoring_results[_raw]()- application roles:
snowflake.data_quality_monitoring_{admin, viewer, lookup}allow access to table, view or function - also
selecton the table andusageon DMF
- application roles:
- Identify individual records that failed data quality checks:
system$data_metric_scan
- Dedicated event view/table and function:
- DMF can run incrementally, provided certain conditions are met
FLATTEN¶
parameters:
PATH: sub-path within the expression to be expandedOUTER: if true at least one row is generated even ifPATHcannot be accessed.NULLvalues forKEY,INDEX,VALUEMODE:OBJECT,ARRAYorBOTH(default)RECURSIVE: expand all sub-elements too (default false); use metadata columnsTHISto determine the value being exploded
Stored Proc¶
- Can use Snowflake Scripting, JavaScript, Snowpark (Python, Scala, Java)
- Anonymous procedures do not require
create schemaprivilege - owner-rights stored-proc limitations:
- cannot alter session state (e.g. session variables, current database etc)
- cannot query
information_schematable functions that return data based oncurrent_user()(e.g.automatic_clustering_history())
- SQL Stored-procedure can return a table
- SQL stored-procedures are converted to javascript internally
- A transaction started in one scope must end in the same scope. IOW, A stored-procedure call either must be inside a transaction, or the transaction must be inside the stored-procedure call.
- available Javascript objects:
Snowflake: available by defaultStatement: provides a way to execute SQLResultSet: results of a query executionSfDate: corresponds to SQLTIMESTAMPdata type
- A nested caller right proc is called as caller right iff all calls in hierarchy are caller right. An owner right sp is always called as owner right
- stored procedures can have optional arguments (parameter names defined with
DEFAULTclause)