Skip to content

Big Data

  • types of services: BigQuery, Pub/Sub, Data Flow, Data proc, Datalab
  • Datalab is Jupyter on cloud with seamless integration with other GCP services
  • Data Loss Prevention DLP API allows PII and other information to be recognized such as credit-card, phone numbers

Cloud Composer

  • is a managed Apache Airflow service
  • In GCS, composer requires at least one environment to be created, which in turn uses Kubernetes instance
  • DAGs are stored in storage buckets, changes to which are automatically imported by Composer

Data Lake

  • Common sink for raw data, usually stored in Google Storage as blobs
  • Common source for DW (BigQuery)
  • Data load strategies:
  • EL: When source data format is readily consumable (eg BigQuery can read Avro, Parquet, JSON files)
  • ELT: When transformation required on source data is relatively cheap or easy (BigQuery can do that in SQL)
  • ETL: When transformation required on source data is complex or expensive to perform in DW

Dataproc

  • Hadoop cluster without cluster management and dynamic scaling (while jobs are running)
  • two ways to select components: Optional components and initialization actions
  • use custom images instead of startup scripts for faster startup
  • Preemptible VMs (PVM) are up to 80% discounted than regular VMs
  • good for cluster auto-scaling or setting up transient clusters
  • persistent data must be stored off-cluster (GCS, BigTable, BigQuery) using connectors
  • BP: Pick HDFS over GCS when,
  • a lot of metadata operations, ie lots of directories and partitions
  • Modify data, rename files or HDFS append operations frequently (because GCS is immutable)
  • heavy IO, e.g. a lot of partitioned writes
  • low latency is required
  • BP: read/write initial/final outputs to GCS and intermediate job outputs to HDFS
  • attach large SSDs to worker nodes to increase available local HDFS storage
  • BP: choosing between large number of nodes with lower # of CPUs or smaller number of nodes with each having high # of CPUs
  • there is limit to total number of persistent storage per VM (64TB total or 3TB (375GB * 8) SSD)
  • cluster can be automatically deleted if idle (10 min.s or up to 14 days)
  • cluster can be auto scaled
  • cluster can be provisioned with optional components such as Presto, Jupyter, Anaconda etc
  • Use workflow templates to configure, submit, monitor and delete ephemeral clusters
  • Use gcloud to submit jobs rather than native hdfs commands to enable capturing logs via stackdriver

gcloud

  • Create a cluster:
gcloud dataproc clusters create sparktodp --region us-central1 --subnet default --zone us-central1-b \
  --enable-component-gateway \  # allow access to Spark components UI
  --optional-components ANACONDA,JUPYTER \
  --master-machine-type n1-standard-4 --num-workers 2 --worker-machine-type n1-standard-4
  • get associated GCS bucket name: gcloud dataproc clusters describe sparktodp --region=us-central1 --format=json | jq -r '.config.configBucket'
  • submit a job: gcloud dataproc jobs submit pyspark --cluster sparktodp --region us-central1 spark_analysis.py -- --bucket=$1
  • sample Spark job

Data Fusion

  • UI based batch data pipelines creation, optionally using templates (eg, GCS to BQ)
  • creates long running instance but is not charged, only the pipeline execution is charged
  • instances are created on a GKE cluster as a tenant project
  • instance's service account needs to be granted Cloud Data Fusion API Service Agent role
  • Componenets:
  • Control Center: Applications (consist of 1+ pipelines), Artifacts (non-data eg data dictionary), Dataset
  • Pipelines: DAGs built using Developer Studio
  • Wrangler: Data Exploration
  • Rules Engine allows users to define data transformations and group them in Rulebook
    • integrated with data pipelines
  • Metadata Aggregator stores business, operational and technical metadata
    • allows: better governance using data lineage, produce shareable data dictionaries
    • integration with enterprise MDM and the Rules Engine
  • integration with Data Wrangler
  • provisions ephemeral execution environment (dataproc) and runs pipeline (Spark or MR)
  • optionally it can use a pre-provisioned cluster

Dataprep

  • Data Wrangling tool to interactively explore and build a Flow or a recipe of transformations to be applied.
  • Transformations:
  • change data type
  • remove unused columns
  • add derived column
  • map values using CASE statement
  • dedup rows
  • filter in/out rows based on column values
  • apply formula to columns
  • pivot/unpivot
  • When running a flow, Cloud Dataflow job is created and submitted

DataFlow

  • serverless Apache beam
  • adaptive transform-by-transform auto-scaling
  • v/s reactive auto-scaling based on cluster utilization of dataproc
  • concepts
  • PTransform are actions or code that act on data, can be input (source), transforms or output (sink)
  • PCollections are immutable data that flow from one PTransform to to another
    • PCollection stores data as serialized Elements in a byte stream, and are only deserialized when need to be acted upon
    • Bounded PCollections are made up of fixed number of Elements
    • Unbounded PCollections are streaming and don't have fixed number of Elements
  • Pipeline is a DAG collection of PTransforms
  • Pipeline Runners are the back-end systems that run actions
  • Optimized execution path
  • Automatic dynamic load balancing by allocating work from busy nodes to other nodes
  • v/s static distribution of work, and the entire cluster must wait for the slower nodes to finish
  • Auto-healing: failed nodes' work is routed to other nodes
  • Transforms:
  • ParDo: implements parallel processing
    • works on one element at a time, cannot have state
    • can accept side inputs, or produce side output
    • can be stateful by allowing @Setup and @Teardown methods, e.g. setup database connection for lookups
  • GroupByKey: creates a tuple of key and group of values
  • CoGroupByKey: joins 2 or more PCollections (K, V) and (K, W) using the same key with output (K, [V], [W])
  • Combine*: reduce or combines a PCollection by applying a function such as sum, min, max etc
    • eg. CombineGlobally(sum) or CombinePerKey(sum) applied to all values or grouped key-value pairs of PCollection respectively
    • sub-classes of CombineFn can implement custom logic, but they must be commutative and associative
    • CombineByKey is likely to be more efficient than GroupByKey because it does local grouping first which can be parallelized
  • Flatten: flattens multiple PCollections into one
  • Partition:
  • Can create pipeline templates for end-users
  • special add_value_provider_argument allows end-users to provide argument at run-time
  • Cloud Dataflow SQL pipelines can process data using SQL without requiring Python or Java knowledge
  • Q: how do you implement exactly once semantics with pubsub? A: Use dataflow pipeline
  • if the duplicate message arrives after 10 minutes, it won't be deduped
  • Q: how do you handle malformed data since dataflow will try to reprocess forever? A: Use dead-letter side-output in ParDo

Streaming

  • Unlike batch, which has one Global window, streaming has 3 windowing techniques
  • Fixed: non-overlapping intervals, e.g. hourly, daily etc
  • Sliding: can overlap, e.g. 30 sec duration window every 5 sec period. Useful for computing running aggregates
  • Session: data-dependent window, e.g. session events, can't know the window size without looking at data
  • Lag Time is the difference between expected arrival time and the actual arrive time of late arriving messages (due to various reasons)
  • Watermark is way to determine when a window ends
  • Heuristic Watermark techniques try to guess when a window should be closed and emitted
  • Processing Late Data is done by triggering.
  • data arriving later than watermark are, by default, discarded
  • Triggering: decides when aggregation are triggered
  • AfterWatermark: this is the default, data arrived within watermark are included
    • Python API discards late data, but Java allows late data API
  • AfterProcessingTime triggered after processing every interval based on system clock (v/s timestamp associated with the data)
  • AfterCount: triggers after certain number of elements have been processed
  • Composite trigger allow combining above triggers
  • AccumulationMode can be either
  • accumulate data within current window is accumulated and emitted during each trigger
  • discarding only data since the last trigger is emitted

Pub/Sub

  • Pub/Sub has at least once delivery guarantee => message may be delivered multiple times
  • Pub/Sub supports push/pull deliver: i.e. it can notify subscribers when new message arrives
  • Saves events up to 7 days
  • Pull and Push subscriptions
  • Pull can be async or sync
  • streaming pull is the most efficient, and push is the least efficient
  • Order of messages isn't guaranteed
  • topics are named as projects/<project-id>/topics/<topic>
  • Ensure user/service account has correct permission on the topic (publisher/subscriber etc)
  • seek allows replaying messages already acknowledged, two ways to do that
  • snapshot: messages received before the snapshot are restored and messages received after are new
  • timestamp: messages received before a specific time are considered read
    • timestamp can be moved forward to discard unprocessed messages

Airflow

  • concepts
  • Operator: A description of a single task. E.g. {Bash,Python,Jdbc,Oracle,MySql,S3FileTransfer}Operator
  • Task: Parameterized Operator
  • Task Instance: A specific run of a Task
  • DAG: Collections of Tasks with their dependencies
  • Executor: runs a task. Concurrent tasks are limited by number of available executors
  • dependencies: task1 >> task2, task1 is upstream to task2
  • XComs: mechanism to let tasks exchange messages
  • Variable: key-value store to share settings, values etc