Big Data
- types of services: BigQuery, Pub/Sub, Data Flow, Data proc, Datalab
- Datalab is Jupyter on cloud with seamless integration with other GCP services
- Data Loss Prevention DLP API allows PII and other information to be recognized such as credit-card, phone numbers
Cloud Composer
- is a managed Apache Airflow service
- In GCS, composer requires at least one environment to be created, which in turn uses Kubernetes instance
- DAGs are stored in storage buckets, changes to which are automatically imported by Composer
Data Lake
- Common sink for raw data, usually stored in Google Storage as blobs
- Common source for DW (BigQuery)
- Data load strategies:
- EL: When source data format is readily consumable (eg BigQuery can read Avro, Parquet, JSON files)
- ELT: When transformation required on source data is relatively cheap or easy (BigQuery can do that in SQL)
- ETL: When transformation required on source data is complex or expensive to perform in DW
Dataproc
- Hadoop cluster without cluster management and dynamic scaling (while jobs are running)
- two ways to select components: Optional components and initialization actions
- use custom images instead of startup scripts for faster startup
- Preemptible VMs (PVM) are up to 80% discounted than regular VMs
- good for cluster auto-scaling or setting up transient clusters
- persistent data must be stored off-cluster (GCS, BigTable, BigQuery) using connectors
- BP: Pick HDFS over GCS when,
- a lot of metadata operations, ie lots of directories and partitions
- Modify data, rename files or HDFS append operations frequently (because GCS is immutable)
- heavy IO, e.g. a lot of partitioned writes
- low latency is required
- BP: read/write initial/final outputs to GCS and intermediate job outputs to HDFS
- attach large SSDs to worker nodes to increase available local HDFS storage
- BP: choosing between large number of nodes with lower # of CPUs or smaller number of nodes with each having high # of CPUs
- there is limit to total number of persistent storage per VM (64TB total or 3TB (375GB * 8) SSD)
- cluster can be automatically deleted if idle (10 min.s or up to 14 days)
- cluster can be auto scaled
- cluster can be provisioned with optional components such as Presto, Jupyter, Anaconda etc
- Use workflow templates to configure, submit, monitor and delete ephemeral clusters
- Use
gcloud to submit jobs rather than native hdfs commands to enable capturing logs via stackdriver
gcloud
gcloud dataproc clusters create sparktodp --region us-central1 --subnet default --zone us-central1-b \
--enable-component-gateway \ # allow access to Spark components UI
--optional-components ANACONDA,JUPYTER \
--master-machine-type n1-standard-4 --num-workers 2 --worker-machine-type n1-standard-4
- get associated GCS bucket name:
gcloud dataproc clusters describe sparktodp --region=us-central1 --format=json | jq -r '.config.configBucket'
- submit a job:
gcloud dataproc jobs submit pyspark --cluster sparktodp --region us-central1 spark_analysis.py -- --bucket=$1
- sample Spark job
Data Fusion
- UI based batch data pipelines creation, optionally using templates (eg, GCS to BQ)
- creates long running instance but is not charged, only the pipeline execution is charged
- instances are created on a GKE cluster as a tenant project
- instance's service account needs to be granted Cloud Data Fusion API Service Agent role
- Componenets:
- Control Center: Applications (consist of 1+ pipelines), Artifacts (non-data eg data dictionary), Dataset
- Pipelines: DAGs built using Developer Studio
- Wrangler: Data Exploration
- Rules Engine allows users to define data transformations and group them in Rulebook
- integrated with data pipelines
- Metadata Aggregator stores business, operational and technical metadata
- allows: better governance using data lineage, produce shareable data dictionaries
- integration with enterprise MDM and the Rules Engine
- integration with Data Wrangler
- provisions ephemeral execution environment (dataproc) and runs pipeline (Spark or MR)
- optionally it can use a pre-provisioned cluster
Dataprep
- Data Wrangling tool to interactively explore and build a Flow or a recipe of transformations to be applied.
- Transformations:
- change data type
- remove unused columns
- add derived column
- map values using CASE statement
- dedup rows
- filter in/out rows based on column values
- apply formula to columns
- pivot/unpivot
- When running a flow, Cloud Dataflow job is created and submitted
DataFlow
- serverless Apache beam
- adaptive transform-by-transform auto-scaling
- v/s reactive auto-scaling based on cluster utilization of dataproc
- concepts
- PTransform are actions or code that act on data, can be input (source), transforms or output (sink)
- PCollections are immutable data that flow from one PTransform to to another
- PCollection stores data as serialized Elements in a byte stream, and are only deserialized when need to be acted upon
- Bounded PCollections are made up of fixed number of Elements
- Unbounded PCollections are streaming and don't have fixed number of Elements
- Pipeline is a DAG collection of PTransforms
- Pipeline Runners are the back-end systems that run actions
- Optimized execution path
- Automatic dynamic load balancing by allocating work from busy nodes to other nodes
- v/s static distribution of work, and the entire cluster must wait for the slower nodes to finish
- Auto-healing: failed nodes' work is routed to other nodes
- Transforms:
ParDo: implements parallel processing
- works on one element at a time, cannot have state
- can accept side inputs, or produce side output
- can be stateful by allowing
@Setup and @Teardown methods, e.g. setup database connection for lookups
GroupByKey: creates a tuple of key and group of values
CoGroupByKey: joins 2 or more PCollections (K, V) and (K, W) using the same key with output (K, [V], [W])
Combine*: reduce or combines a PCollection by applying a function such as sum, min, max etc
- eg.
CombineGlobally(sum) or CombinePerKey(sum) applied to all values or grouped key-value pairs of PCollection respectively
- sub-classes of
CombineFn can implement custom logic, but they must be commutative and associative
CombineByKey is likely to be more efficient than GroupByKey because it does local grouping first which can be parallelized
Flatten: flattens multiple PCollections into one
Partition:
- Can create pipeline templates for end-users
- special
add_value_provider_argument allows end-users to provide argument at run-time
- Cloud Dataflow SQL pipelines can process data using SQL without requiring Python or Java knowledge
- Q: how do you implement exactly once semantics with pubsub? A: Use dataflow pipeline
- if the duplicate message arrives after 10 minutes, it won't be deduped
- Q: how do you handle malformed data since dataflow will try to reprocess forever? A: Use dead-letter side-output in
ParDo
Streaming
- Unlike batch, which has one Global window, streaming has 3 windowing techniques
- Fixed: non-overlapping intervals, e.g. hourly, daily etc
- Sliding: can overlap, e.g. 30 sec duration window every 5 sec period. Useful for computing running aggregates
- Session: data-dependent window, e.g. session events, can't know the window size without looking at data
- Lag Time is the difference between expected arrival time and the actual arrive time of late arriving messages (due to various reasons)
- Watermark is way to determine when a window ends
- Heuristic Watermark techniques try to guess when a window should be closed and emitted
- Processing Late Data is done by triggering.
- data arriving later than watermark are, by default, discarded
- Triggering: decides when aggregation are triggered
- AfterWatermark: this is the default, data arrived within watermark are included
- Python API discards late data, but Java allows late data API
- AfterProcessingTime triggered after processing every interval based on system clock (v/s timestamp associated with the data)
- AfterCount: triggers after certain number of elements have been processed
- Composite trigger allow combining above triggers
- AccumulationMode can be either
- accumulate data within current window is accumulated and emitted during each trigger
- discarding only data since the last trigger is emitted
Pub/Sub
- Pub/Sub has at least once delivery guarantee => message may be delivered multiple times
- Pub/Sub supports push/pull deliver: i.e. it can notify subscribers when new message arrives
- Saves events up to 7 days
- Pull and Push subscriptions
- Pull can be async or sync
- streaming pull is the most efficient, and push is the least efficient
- Order of messages isn't guaranteed
- topics are named as
projects/<project-id>/topics/<topic>
- Ensure user/service account has correct permission on the topic (publisher/subscriber etc)
- seek allows replaying messages already acknowledged, two ways to do that
- snapshot: messages received before the snapshot are restored and messages received after are new
- timestamp: messages received before a specific time are considered read
- timestamp can be moved forward to discard unprocessed messages
Airflow
- concepts
- Operator: A description of a single task. E.g.
{Bash,Python,Jdbc,Oracle,MySql,S3FileTransfer}Operator
- Task: Parameterized Operator
- Task Instance: A specific run of a Task
- DAG: Collections of Tasks with their dependencies
- Executor: runs a task. Concurrent tasks are limited by number of available executors
- dependencies:
task1 >> task2, task1 is upstream to task2
- XComs: mechanism to let tasks exchange messages
- Variable: key-value store to share settings, values etc