Architecture¶

Spark job¶

A job Consists of one driver process and multiple executor processes running in worker nodes, each executor has slots which are roughly number of cores
SparkSession is interface to the driver process
A job run is a series of stages. A stage consists of 1+ tasks, each task requires one slot to run.
Jobs can, but usually not, run in parallel. Stages run sequentially. Tasks run in parallel provided enough slots are available
A script using DataFrame API may result in 1+ Spark jobs, whereas using RDD results in a single job
3 ways to run a Spark job
local: both drivers and workers run locally
client: driver runs locally and workers run on remote nodes
cluster: both drivers and works run on remote nodes

DataFrame/RDD¶

DataFrame
higher-level, newer API
resides in org.apache.spark.sql.* namespace
uses tungsten to optimize memory and CPU utilization at stage level to minimize overhead of JVM GC
RDD
low-level APIs
resides in org.apache.spark.*
Data abstractions: DataFrame, Dataset, SQL Table and RDD (Resilient Distributed Data)
are immutable
DataFrame have types checked at run-time and are of type Row
DataSet available in Scala as case class or with JavaBeans
DataFrame and DataSet are converted to RDD during physical planning
Operations consist of either transformations or actions
Transformations can be narrow (doesn't require shuffle) or wide (requires shuffle)
Actions, spawn jobs, jobs consists of stages, stages spawn tasks
Tasks = Slots = Cores
Data Types
Data can be Row object or can be represented by Encoders (available only as Scala case class or Java beans)
Data locality: PROCESS_LOCAL > NODE_LOCAL > RACK_LOCAL > NO_PREF
affects wait times depending on how local the data is. Increase wait times for far-off data
From Catalyst optimizer pov, RDDs are schema-less, and hence can't be optimized
From Scala's pov, DataFrame is untyped, whereas RDD and DataSet are typed

DataFrame Query planning¶

uses catalyst optimizer

flowchart TD
  DataFrame --> ULP[Unresolved Logical Plan]
  SQL[SQL Query] --> ULP
  ULP --Ananlysis--> LP[Logical Plan]
  Catalog --Ananlysis--> LP
  LP --Logical Optimization--> OLP[Optimized Logical Plan]
  OLP --Physical Planning--> PP@{ shape: st-rect, label: Physical Plans }
  PP --> CM[Cost Model]
  CM --> SPL[Selected Physical Plan]
  SPL --Code Generation--> RDD