Skip to content

Architecture

Spark job

  • A job Consists of one driver process and multiple executor processes running in worker nodes, each executor has slots which are roughly number of cores
  • SparkSession is interface to the driver process
  • A job run is a series of stages. A stage consists of 1+ tasks, each task requires one slot to run.
  • Jobs can, but usually not, run in parallel. Stages run sequentially. Tasks run in parallel provided enough slots are available
  • A script using DataFrame API may result in 1+ Spark jobs, whereas using RDD results in a single job
  • 3 ways to run a Spark job
  • local: both drivers and workers run locally
  • client: driver runs locally and workers run on remote nodes
  • cluster: both drivers and works run on remote nodes

DataFrame/RDD

  • DataFrame
  • higher-level, newer API
  • resides in org.apache.spark.sql.* namespace
  • uses tungsten to optimize memory and CPU utilization at stage level to minimize overhead of JVM GC
  • RDD
  • low-level APIs
  • resides in org.apache.spark.*
  • Data abstractions: DataFrame, Dataset, SQL Table and RDD (Resilient Distributed Data)
  • are immutable
  • DataFrame have types checked at run-time and are of type Row
  • DataSet available in Scala as case class or with JavaBeans
  • DataFrame and DataSet are converted to RDD during physical planning
  • Operations consist of either transformations or actions
  • Transformations can be narrow (doesn't require shuffle) or wide (requires shuffle)
  • Actions, spawn jobs, jobs consists of stages, stages spawn tasks
  • Tasks = Slots = Cores
  • Data Types
  • Data can be Row object or can be represented by Encoders (available only as Scala case class or Java beans)
  • Data locality: PROCESS_LOCAL > NODE_LOCAL > RACK_LOCAL > NO_PREF
  • affects wait times depending on how local the data is. Increase wait times for far-off data
  • From Catalyst optimizer pov, RDDs are schema-less, and hence can't be optimized
  • From Scala's pov, DataFrame is untyped, whereas RDD and DataSet are typed

DataFrame Query planning

flowchart TD
  DataFrame --> ULP[Unresolved Logical Plan]
  SQL[SQL Query] --> ULP
  ULP --Ananlysis--> LP[Logical Plan]
  Catalog --Ananlysis--> LP
  LP --Logical Optimization--> OLP[Optimized Logical Plan]
  OLP --Physical Planning--> PP@{ shape: st-rect, label: Physical Plans }
  PP --> CM[Cost Model]
  CM --> SPL[Selected Physical Plan]
  SPL --Code Generation--> RDD