Big Data¶

types of services: BigQuery, Pub/Sub, Data Flow, Data proc, Datalab
Datalab is Jupyter on cloud with seamless integration with other GCP services
Data Loss Prevention DLP API allows PII and other information to be recognized such as credit-card, phone numbers

Cloud Composer¶

is a managed Apache Airflow service
In GCS, composer requires at least one environment to be created, which in turn uses Kubernetes instance
DAGs are stored in storage buckets, changes to which are automatically imported by Composer

Data Lake¶

Common sink for raw data, usually stored in Google Storage as blobs
Common source for DW (BigQuery)
Data load strategies:
EL: When source data format is readily consumable (eg BigQuery can read Avro, Parquet, JSON files)
ELT: When transformation required on source data is relatively cheap or easy (BigQuery can do that in SQL)
ETL: When transformation required on source data is complex or expensive to perform in DW

Dataproc¶

Hadoop cluster without cluster management and dynamic scaling (while jobs are running)
two ways to select components: Optional components and initialization actions
use custom images instead of startup scripts for faster startup
Preemptible VMs (PVM) are up to 80% discounted than regular VMs
good for cluster auto-scaling or setting up transient clusters
persistent data must be stored off-cluster (GCS, BigTable, BigQuery) using connectors
BP: Pick HDFS over GCS when,
a lot of metadata operations, ie lots of directories and partitions
Modify data, rename files or HDFS append operations frequently (because GCS is immutable)
heavy IO, e.g. a lot of partitioned writes
low latency is required
BP: read/write initial/final outputs to GCS and intermediate job outputs to HDFS
attach large SSDs to worker nodes to increase available local HDFS storage
BP: choosing between large number of nodes with lower # of CPUs or smaller number of nodes with each having high # of CPUs
there is limit to total number of persistent storage per VM (64TB total or 3TB (375GB * 8) SSD)
cluster can be automatically deleted if idle (10 min.s or up to 14 days)
cluster can be auto scaled
cluster can be provisioned with optional components such as Presto, Jupyter, Anaconda etc
Use workflow templates to configure, submit, monitor and delete ephemeral clusters
Use gcloud to submit jobs rather than native hdfs commands to enable capturing logs via stackdriver

gcloud¶

Create a cluster:

gcloud dataproc clusters create sparktodp --region us-central1 --subnet default --zone us-central1-b \
  --enable-component-gateway \  # allow access to Spark components UI
  --optional-components ANACONDA,JUPYTER \
  --master-machine-type n1-standard-4 --num-workers 2 --worker-machine-type n1-standard-4

get associated GCS bucket name: gcloud dataproc clusters describe sparktodp --region=us-central1 --format=json | jq -r '.config.configBucket'
submit a job: gcloud dataproc jobs submit pyspark --cluster sparktodp --region us-central1 spark_analysis.py -- --bucket=$1
sample Spark job

Data Fusion¶

UI based batch data pipelines creation, optionally using templates (eg, GCS to BQ)
creates long running instance but is not charged, only the pipeline execution is charged
instances are created on a GKE cluster as a tenant project
instance's service account needs to be granted Cloud Data Fusion API Service Agent role
Componenets:
Control Center: Applications (consist of 1+ pipelines), Artifacts (non-data eg data dictionary), Dataset
Pipelines: DAGs built using Developer Studio
Wrangler: Data Exploration
Rules Engine allows users to define data transformations and group them in Rulebook
- integrated with data pipelines
Metadata Aggregator stores business, operational and technical metadata
- allows: better governance using data lineage, produce shareable data dictionaries
- integration with enterprise MDM and the Rules Engine
integration with Data Wrangler
provisions ephemeral execution environment (dataproc) and runs pipeline (Spark or MR)
optionally it can use a pre-provisioned cluster

Dataprep¶

Data Wrangling tool to interactively explore and build a Flow or a recipe of transformations to be applied.
Transformations:
change data type
remove unused columns
add derived column
map values using CASE statement
dedup rows
filter in/out rows based on column values
apply formula to columns
pivot/unpivot
When running a flow, Cloud Dataflow job is created and submitted

DataFlow¶

serverless Apache beam
adaptive transform-by-transform auto-scaling
v/s reactive auto-scaling based on cluster utilization of dataproc
concepts
PTransform are actions or code that act on data, can be input (source), transforms or output (sink)
PCollections are immutable data that flow from one PTransform to to another
- PCollection stores data as serialized Elements in a byte stream, and are only deserialized when need to be acted upon
- Bounded PCollections are made up of fixed number of Elements
- Unbounded PCollections are streaming and don't have fixed number of Elements
Pipeline is a DAG collection of PTransforms
Pipeline Runners are the back-end systems that run actions
Optimized execution path
Automatic dynamic load balancing by allocating work from busy nodes to other nodes
v/s static distribution of work, and the entire cluster must wait for the slower nodes to finish
Auto-healing: failed nodes' work is routed to other nodes
Transforms:
ParDo: implements parallel processing
- works on one element at a time, cannot have state
- can accept side inputs, or produce side output
- can be stateful by allowing @Setup and @Teardown methods, e.g. setup database connection for lookups
GroupByKey: creates a tuple of key and group of values
CoGroupByKey: joins 2 or more PCollections (K, V) and (K, W) using the same key with output (K, [V], [W])
Combine*: reduce or combines a PCollection by applying a function such as sum, min, max etc
- eg. CombineGlobally(sum) or CombinePerKey(sum) applied to all values or grouped key-value pairs of PCollection respectively
- sub-classes of CombineFn can implement custom logic, but they must be commutative and associative
- CombineByKey is likely to be more efficient than GroupByKey because it does local grouping first which can be parallelized
Flatten: flattens multiple PCollections into one
Partition:
Can create pipeline templates for end-users
special add_value_provider_argument allows end-users to provide argument at run-time
Cloud Dataflow SQL pipelines can process data using SQL without requiring Python or Java knowledge
Q: how do you implement exactly once semantics with pubsub? A: Use dataflow pipeline
if the duplicate message arrives after 10 minutes, it won't be deduped
Q: how do you handle malformed data since dataflow will try to reprocess forever? A: Use dead-letter side-output in ParDo

Streaming¶

Unlike batch, which has one Global window, streaming has 3 windowing techniques
Fixed: non-overlapping intervals, e.g. hourly, daily etc
Sliding: can overlap, e.g. 30 sec duration window every 5 sec period. Useful for computing running aggregates
Session: data-dependent window, e.g. session events, can't know the window size without looking at data
Lag Time is the difference between expected arrival time and the actual arrive time of late arriving messages (due to various reasons)
Watermark is way to determine when a window ends
Heuristic Watermark techniques try to guess when a window should be closed and emitted
Processing Late Data is done by triggering.
data arriving later than watermark are, by default, discarded
Triggering: decides when aggregation are triggered
AfterWatermark: this is the default, data arrived within watermark are included
- Python API discards late data, but Java allows late data API
AfterProcessingTime triggered after processing every interval based on system clock (v/s timestamp associated with the data)
AfterCount: triggers after certain number of elements have been processed
Composite trigger allow combining above triggers
AccumulationMode can be either
accumulate data within current window is accumulated and emitted during each trigger
discarding only data since the last trigger is emitted

Pub/Sub¶

Pub/Sub has at least once delivery guarantee => message may be delivered multiple times
Pub/Sub supports push/pull deliver: i.e. it can notify subscribers when new message arrives
Saves events up to 7 days
Pull and Push subscriptions
Pull can be async or sync
streaming pull is the most efficient, and push is the least efficient
Order of messages isn't guaranteed
topics are named as projects/<project-id>/topics/<topic>
Ensure user/service account has correct permission on the topic (publisher/subscriber etc)
seek allows replaying messages already acknowledged, two ways to do that
snapshot: messages received before the snapshot are restored and messages received after are new
timestamp: messages received before a specific time are considered read
- timestamp can be moved forward to discard unprocessed messages

Airflow¶

concepts
Operator: A description of a single task. E.g. {Bash,Python,Jdbc,Oracle,MySql,S3FileTransfer}Operator
Task: Parameterized Operator
Task Instance: A specific run of a Task
DAG: Collections of Tasks with their dependencies
Executor: runs a task. Concurrent tasks are limited by number of available executors
dependencies: task1 >> task2, task1 is upstream to task2
XComs: mechanism to let tasks exchange messages
Variable: key-value store to share settings, values etc