Skip to content

Architecture

Spark job

  • A job Consists of one driver process and multiple executor processes running in worker nodes, each executor has slots which are roughly number of cores
  • SparkSession is interface to the driver process
  • A job run is a series of stages. A stage consists of 1+ tasks, each task requires one slot to run.
    • Jobs can, but usually not, run in parallel. Stages run sequentially. Tasks run in parallel provided enough slots are available
  • A script using DataFrame API may result in 1+ Spark jobs, whereas using RDD results in a single job
  • 3 ways to run a Spark job
    • local: both drivers and workers run locally
    • client: driver runs locally and workers run on remote nodes
    • cluster: both drivers and works run on remote nodes

DataFrame/RDD

  • DataFrame
    • higher-level, newer API
    • resides in org.apache.spark.sql.* namespace
    • uses tungsten to optimize memory and CPU utilization at stage level to minimize overhead of JVM GC
  • RDD
    • low-level APIs
    • resides in org.apache.spark.*
  • Data abstractions: DataFrame, Dataset, SQL Table and RDD (Resilient Distributed Data)
    • are immutable
    • DataFrame have types checked at run-time and are of type Row
    • DataSet available in Scala as case class or with JavaBeans
    • DataFrame and DataSet are converted to RDD during physical planning
  • Operations consist of either transformations or actions
  • Transformations can be narrow (doesn't require shuffle) or wide (requires shuffle)
  • Actions, spawn jobs, jobs consists of stages, stages spawn tasks
    • Tasks = Slots = Cores
  • Data Types
  • Data can be Row object or can be represented by Encoders (available only as Scala case class or Java beans)
  • Data locality: PROCESS_LOCAL > NODE_LOCAL > RACK_LOCAL > NO_PREF
    • affects wait times depending on how local the data is. Increase wait times for far-off data
  • From Catalyst optimizer pov, RDDs are schema-less, and hence can't be optimized
  • From Scala's pov, DataFrame is untyped, whereas RDD and DataSet are typed

DataFrame Query planning

flowchart TD
  DataFrame --> ULP[Unresolved Logical Plan]
  SQL[SQL Query] --> ULP
    ULP --Ananlysis--> LP[Logical Plan]
    Catalog --Ananlysis--> LP
    LP --Logical Optimization--> OLP[Optimized Logical Plan]
    OLP --Physical Planning--> PP@{ shape: st-rect, label: Physical Plans }
    PP --> CM[Cost Model]
    CM --> SPL[Selected Physical Plan]
    SPL --Code Generation--> RDD