Skip to content

Big Data

  • types of services: BigQuery, Pub/Sub, Data Flow, Data proc, Datalab
  • Datalab is Jupyter on cloud with seamless integration with other GCP services
  • Data Loss Prevention DLP API allows PII and other information to be recognized such as credit-card, phone numbers

Cloud Composer

  • is a managed Apache Airflow service
  • In GCS, composer requires at least one environment to be created, which in turn uses Kubernetes instance
  • DAGs are stored in storage buckets, changes to which are automatically imported by Composer

Data Lake

  • Common sink for raw data, usually stored in Google Storage as blobs
  • Common source for DW (BigQuery)
  • Data load strategies:
    • EL: When source data format is readily consumable (eg BigQuery can read Avro, Parquet, JSON files)
    • ELT: When transformation required on source data is relatively cheap or easy (BigQuery can do that in SQL)
    • ETL: When transformation required on source data is complex or expensive to perform in DW

Dataproc

  • Hadoop cluster without cluster management and dynamic scaling (while jobs are running)
  • two ways to select components: Optional components and initialization actions
  • use custom images instead of startup scripts for faster startup
  • Preemptible VMs (PVM) are up to 80% discounted than regular VMs
    • good for cluster auto-scaling or setting up transient clusters
    • persistent data must be stored off-cluster (GCS, BigTable, BigQuery) using connectors
  • BP: Pick HDFS over GCS when,
    • a lot of metadata operations, ie lots of directories and partitions
    • Modify data, rename files or HDFS append operations frequently (because GCS is immutable)
    • heavy IO, e.g. a lot of partitioned writes
    • low latency is required
  • BP: read/write initial/final outputs to GCS and intermediate job outputs to HDFS
    • attach large SSDs to worker nodes to increase available local HDFS storage
  • BP: choosing between large number of nodes with lower # of CPUs or smaller number of nodes with each having high # of CPUs
    • there is limit to total number of persistent storage per VM (64TB total or 3TB (375GB * 8) SSD)
  • cluster can be automatically deleted if idle (10 min.s or up to 14 days)
  • cluster can be auto scaled
  • cluster can be provisioned with optional components such as Presto, Jupyter, Anaconda etc
  • Use workflow templates to configure, submit, monitor and delete ephemeral clusters
  • Use gcloud to submit jobs rather than native hdfs commands to enable capturing logs via stackdriver

gcloud

  • Create a cluster:
    gcloud dataproc clusters create sparktodp --region us-central1 --subnet default --zone us-central1-b \
        --enable-component-gateway \  # allow access to Spark components UI
        --optional-components ANACONDA,JUPYTER \
        --master-machine-type n1-standard-4 --num-workers 2 --worker-machine-type n1-standard-4
    
  • get associated GCS bucket name: gcloud dataproc clusters describe sparktodp --region=us-central1 --format=json | jq -r '.config.configBucket'
  • submit a job: gcloud dataproc jobs submit pyspark --cluster sparktodp --region us-central1 spark_analysis.py -- --bucket=$1
  • sample Spark job

Data Fusion

  • UI based batch data pipelines creation, optionally using templates (eg, GCS to BQ)
  • creates long running instance but is not charged, only the pipeline execution is charged
    • instances are created on a GKE cluster as a tenant project
    • instance's service account needs to be granted Cloud Data Fusion API Service Agent role
  • Componenets:
    • Control Center: Applications (consist of 1+ pipelines), Artifacts (non-data eg data dictionary), Dataset
    • Pipelines: DAGs built using Developer Studio
    • Wrangler: Data Exploration
    • Rules Engine allows users to define data transformations and group them in Rulebook
      • integrated with data pipelines
    • Metadata Aggregator stores business, operational and technical metadata
      • allows: better governance using data lineage, produce shareable data dictionaries
      • integration with enterprise MDM and the Rules Engine
    • integration with Data Wrangler
  • provisions ephemeral execution environment (dataproc) and runs pipeline (Spark or MR)
    • optionally it can use a pre-provisioned cluster

Dataprep

  • Data Wrangling tool to interactively explore and build a Flow or a recipe of transformations to be applied.
  • Transformations:
    • change data type
    • remove unused columns
    • add derived column
    • map values using CASE statement
    • dedup rows
    • filter in/out rows based on column values
    • apply formula to columns
    • pivot/unpivot
  • When running a flow, Cloud Dataflow job is created and submitted

DataFlow

  • serverless Apache beam
  • adaptive transform-by-transform auto-scaling
    • v/s reactive auto-scaling based on cluster utilization of dataproc
  • concepts
    • PTransform are actions or code that act on data, can be input (source), transforms or output (sink)
    • PCollections are immutable data that flow from one PTransform to to another
      • PCollection stores data as serialized Elements in a byte stream, and are only deserialized when need to be acted upon
      • Bounded PCollections are made up of fixed number of Elements
      • Unbounded PCollections are streaming and don't have fixed number of Elements
    • Pipeline is a DAG collection of PTransforms
    • Pipeline Runners are the back-end systems that run actions
  • Optimized execution path
  • Automatic dynamic load balancing by allocating work from busy nodes to other nodes
    • v/s static distribution of work, and the entire cluster must wait for the slower nodes to finish
  • Auto-healing: failed nodes' work is routed to other nodes
  • Transforms:
    • ParDo: implements parallel processing
      • works on one element at a time, cannot have state
      • can accept side inputs, or produce side output
      • can be stateful by allowing @Setup and @Teardown methods, e.g. setup database connection for lookups
    • GroupByKey: creates a tuple of key and group of values
    • CoGroupByKey: joins 2 or more PCollections (K, V) and (K, W) using the same key with output (K, [V], [W])
    • Combine*: reduce or combines a PCollection by applying a function such as sum, min, max etc
      • eg. CombineGlobally(sum) or CombinePerKey(sum) applied to all values or grouped key-value pairs of PCollection respectively
      • sub-classes of CombineFn can implement custom logic, but they must be commutative and associative
      • CombineByKey is likely to be more efficient than GroupByKey because it does local grouping first which can be parallelized
    • Flatten: flattens multiple PCollections into one
    • Partition:
  • Can create pipeline templates for end-users
    • special add_value_provider_argument allows end-users to provide argument at run-time
  • Cloud Dataflow SQL pipelines can process data using SQL without requiring Python or Java knowledge
  • Q: how do you implement exactly once semantics with pubsub? A: Use dataflow pipeline
    • if the duplicate message arrives after 10 minutes, it won't be deduped
  • Q: how do you handle malformed data since dataflow will try to reprocess forever? A: Use dead-letter side-output in ParDo

Streaming

  • Unlike batch, which has one Global window, streaming has 3 windowing techniques
    • Fixed: non-overlapping intervals, e.g. hourly, daily etc
    • Sliding: can overlap, e.g. 30 sec duration window every 5 sec period. Useful for computing running aggregates
    • Session: data-dependent window, e.g. session events, can't know the window size without looking at data
  • Lag Time is the difference between expected arrival time and the actual arrive time of late arriving messages (due to various reasons)
  • Watermark is way to determine when a window ends
    • Heuristic Watermark techniques try to guess when a window should be closed and emitted
  • Processing Late Data is done by triggering.
    • data arriving later than watermark are, by default, discarded
  • Triggering: decides when aggregation are triggered
    • AfterWatermark: this is the default, data arrived within watermark are included
      • Python API discards late data, but Java allows late data API
    • AfterProcessingTime triggered after processing every interval based on system clock (v/s timestamp associated with the data)
    • AfterCount: triggers after certain number of elements have been processed
    • Composite trigger allow combining above triggers
  • AccumulationMode can be either
    • accumulate data within current window is accumulated and emitted during each trigger
    • discarding only data since the last trigger is emitted

Pub/Sub

  • Pub/Sub has at least once delivery guarantee => message may be delivered multiple times
  • Pub/Sub supports push/pull deliver: i.e. it can notify subscribers when new message arrives
  • Saves events up to 7 days
  • Pull and Push subscriptions
    • Pull can be async or sync
  • streaming pull is the most efficient, and push is the least efficient
  • Order of messages isn't guaranteed
  • topics are named as projects/<project-id>/topics/<topic>
  • Ensure user/service account has correct permission on the topic (publisher/subscriber etc)
  • seek allows replaying messages already acknowledged, two ways to do that
    • snapshot: messages received before the snapshot are restored and messages received after are new
    • timestamp: messages received before a specific time are considered read
      • timestamp can be moved forward to discard unprocessed messages

Airflow

  • concepts
    • Operator: A description of a single task. E.g. {Bash,Python,Jdbc,Oracle,MySql,S3FileTransfer}Operator
    • Task: Parameterized Operator
    • Task Instance: A specific run of a Task
    • DAG: Collections of Tasks with their dependencies
    • Executor: runs a task. Concurrent tasks are limited by number of available executors
    • dependencies: task1 >> task2, task1 is upstream to task2
    • XComs: mechanism to let tasks exchange messages
    • Variable: key-value store to share settings, values etc