Architecture
Spark job
- A job Consists of one driver process and multiple executor processes running in worker nodes, each executor has slots which are roughly number of cores
SparkSession is interface to the driver process
- A job run is a series of stages. A stage consists of 1+ tasks, each task requires one slot to run.
- Jobs can, but usually not, run in parallel. Stages run sequentially. Tasks run in parallel provided enough slots are available
- A script using DataFrame API may result in 1+ Spark jobs, whereas using RDD results in a single job
- 3 ways to run a Spark job
- local: both drivers and workers run locally
- client: driver runs locally and workers run on remote nodes
- cluster: both drivers and works run on remote nodes
DataFrame/RDD
- DataFrame
- higher-level, newer API
- resides in
org.apache.spark.sql.* namespace
- uses tungsten to optimize memory and CPU utilization at stage level to minimize overhead of JVM GC
RDD
- low-level APIs
- resides in
org.apache.spark.*
- Data abstractions: DataFrame, Dataset, SQL Table and RDD (Resilient Distributed Data)
- are immutable
- DataFrame have types checked at run-time and are of type Row
- DataSet available in Scala as
case class or with JavaBeans
- DataFrame and DataSet are converted to RDD during physical planning
- Operations consist of either transformations or actions
- Transformations can be narrow (doesn't require shuffle) or wide (requires shuffle)
- Actions, spawn jobs, jobs consists of stages, stages spawn tasks
- Data Types
- Data can be Row object or can be represented by Encoders (available only as Scala case class or Java beans)
- Data locality:
PROCESS_LOCAL > NODE_LOCAL > RACK_LOCAL > NO_PREF
- affects wait times depending on how local the data is. Increase wait times for far-off data
- From Catalyst optimizer pov, RDDs are schema-less, and hence can't be optimized
- From Scala's pov, DataFrame is untyped, whereas RDD and DataSet are typed
DataFrame Query planning
flowchart TD
DataFrame --> ULP[Unresolved Logical Plan]
SQL[SQL Query] --> ULP
ULP --Ananlysis--> LP[Logical Plan]
Catalog --Ananlysis--> LP
LP --Logical Optimization--> OLP[Optimized Logical Plan]
OLP --Physical Planning--> PP@{ shape: st-rect, label: Physical Plans }
PP --> CM[Cost Model]
CM --> SPL[Selected Physical Plan]
SPL --Code Generation--> RDD