Skip to content

Snowpark

  • DataFrame transformations are translated to SQL and run on all nodes. df.collect() or df.to_pandas() run on a single node
  • Multi-threading
    • Can't fork a new process, but multi-threading is allowed
    • For Java UD(T)F each worker node runs an instance of JVM (Ref), but there may be more than one Python interpreter (Ref)
  • Logging, Java create a static instance sl4j of logger; for python use logging.getLogger()

Create UD(T)F and SP

  • UDF can be either anonymous or named. For named UDF, supply name=<udf_name> parameter value
    • named UDF are accessible in the same session
  • define using decorators udf, udtf, udaf or sproc
  • for permanent UDF, use is_permanent = True (default False) and provide value for stage_location; temporary UDFs use Session.get_session_stage()
  • UD(T)F registration can supply imports and packages. These values will be used to override session level values when executing functions
  • define by:
    1. add package dependencies, either
      • for all UDFs using session.add_packages
      • or, specific for each UDF using @udf(packages=...) decorator (overrides any packages added by session.add_package)
      • snowpark library is not automatically uploaded
    2. optionally, add user code/data files using session.add_import
      • referencing local files are allowed (automatically uploaded as part of execution)
  • examples
    import pandas as pd
    import snowflake.snowpark
    import xgboost as xgb
    from snowflake.snowpark.functions import sproc
    
    @sproc(packages=["snowflake-snowpark-python", "pandas", "xgboost==1.5.0"])
    def compute(session: snowflake.snowpark.Session) -> list:
        return [pd.__version__, xgb.__version__]
    
    # register a permanent, named stored-proc
    @sproc(name="minus_one", is_permanent=True, stage_location="@my_stage", replace=True, packages=["snowflake-snowpark-python"])
    def minus_one(session: snowflake.snowpark.Session, x: int) -> int:
        return session.sql(f"select {x} - 1").collect()[0][0]