Big Data¶
- types of services: BigQuery, Pub/Sub, Data Flow, Data proc, Datalab
- Datalab is Jupyter on cloud with seamless integration with other GCP services
- Data Loss Prevention DLP API allows PII and other information to be recognized such as credit-card, phone numbers
Cloud Composer¶
- is a managed Apache Airflow service
- In GCS, composer requires at least one environment to be created, which in turn uses Kubernetes instance
- DAGs are stored in storage buckets, changes to which are automatically imported by Composer
Data Lake¶
- Common sink for raw data, usually stored in Google Storage as blobs
- Common source for DW (BigQuery)
- Data load strategies:
- EL: When source data format is readily consumable (eg BigQuery can read Avro, Parquet, JSON files)
- ELT: When transformation required on source data is relatively cheap or easy (BigQuery can do that in SQL)
- ETL: When transformation required on source data is complex or expensive to perform in DW
Dataproc¶
- Hadoop cluster without cluster management and dynamic scaling (while jobs are running)
- two ways to select components: Optional components and initialization actions
- use custom images instead of startup scripts for faster startup
- Preemptible VMs (PVM) are up to 80% discounted than regular VMs
- good for cluster auto-scaling or setting up transient clusters
- persistent data must be stored off-cluster (GCS, BigTable, BigQuery) using connectors
- BP: Pick HDFS over GCS when,
- a lot of metadata operations, ie lots of directories and partitions
- Modify data, rename files or HDFS append operations frequently (because GCS is immutable)
- heavy IO, e.g. a lot of partitioned writes
- low latency is required
- BP: read/write initial/final outputs to GCS and intermediate job outputs to HDFS
- attach large SSDs to worker nodes to increase available local HDFS storage
- BP: choosing between large number of nodes with lower # of CPUs or smaller number of nodes with each having high # of CPUs
- there is limit to total number of persistent storage per VM (64TB total or 3TB (375GB * 8) SSD)
- cluster can be automatically deleted if idle (10 min.s or up to 14 days)
- cluster can be auto scaled
- cluster can be provisioned with optional components such as Presto, Jupyter, Anaconda etc
- Use workflow templates to configure, submit, monitor and delete ephemeral clusters
- Use
gcloudto submit jobs rather than native hdfs commands to enable capturing logs via stackdriver
gcloud¶
- Create a cluster:
- get associated GCS bucket name:
gcloud dataproc clusters describe sparktodp --region=us-central1 --format=json | jq -r '.config.configBucket' - submit a job:
gcloud dataproc jobs submit pyspark --cluster sparktodp --region us-central1 spark_analysis.py -- --bucket=$1 - sample Spark job
Data Fusion¶
- UI based batch data pipelines creation, optionally using templates (eg, GCS to BQ)
- creates long running instance but is not charged, only the pipeline execution is charged
- instances are created on a GKE cluster as a tenant project
- instance's service account needs to be granted Cloud Data Fusion API Service Agent role
- Componenets:
- Control Center: Applications (consist of 1+ pipelines), Artifacts (non-data eg data dictionary), Dataset
- Pipelines: DAGs built using Developer Studio
- Wrangler: Data Exploration
- Rules Engine allows users to define data transformations and group them in Rulebook
- integrated with data pipelines
- Metadata Aggregator stores business, operational and technical metadata
- allows: better governance using data lineage, produce shareable data dictionaries
- integration with enterprise MDM and the Rules Engine
- integration with Data Wrangler
- provisions ephemeral execution environment (dataproc) and runs pipeline (Spark or MR)
- optionally it can use a pre-provisioned cluster
Dataprep¶
- Data Wrangling tool to interactively explore and build a Flow or a recipe of transformations to be applied.
- Transformations:
- change data type
- remove unused columns
- add derived column
- map values using CASE statement
- dedup rows
- filter in/out rows based on column values
- apply formula to columns
- pivot/unpivot
- When running a flow, Cloud Dataflow job is created and submitted
DataFlow¶
- serverless Apache beam
- adaptive transform-by-transform auto-scaling
- v/s reactive auto-scaling based on cluster utilization of dataproc
- concepts
- PTransform are actions or code that act on data, can be input (source), transforms or output (sink)
- PCollections are immutable data that flow from one PTransform to to another
- PCollection stores data as serialized Elements in a byte stream, and are only deserialized when need to be acted upon
- Bounded PCollections are made up of fixed number of Elements
- Unbounded PCollections are streaming and don't have fixed number of Elements
- Pipeline is a DAG collection of PTransforms
- Pipeline Runners are the back-end systems that run actions
- Optimized execution path
- Automatic dynamic load balancing by allocating work from busy nodes to other nodes
- v/s static distribution of work, and the entire cluster must wait for the slower nodes to finish
- Auto-healing: failed nodes' work is routed to other nodes
- Transforms:
ParDo: implements parallel processing- works on one element at a time, cannot have state
- can accept side inputs, or produce side output
- can be stateful by allowing
@Setupand@Teardownmethods, e.g. setup database connection for lookups
GroupByKey: creates a tuple of key and group of valuesCoGroupByKey: joins 2 or more PCollections(K, V)and(K, W)using the same key with output(K, [V], [W])Combine*: reduce or combines a PCollection by applying a function such assum,min,maxetc- eg.
CombineGlobally(sum)orCombinePerKey(sum)applied to all values or grouped key-value pairs of PCollection respectively - sub-classes of
CombineFncan implement custom logic, but they must be commutative and associative CombineByKeyis likely to be more efficient thanGroupByKeybecause it does local grouping first which can be parallelized
- eg.
Flatten: flattens multiple PCollections into onePartition:
- Can create pipeline templates for end-users
- special
add_value_provider_argumentallows end-users to provide argument at run-time
- special
- Cloud Dataflow SQL pipelines can process data using SQL without requiring Python or Java knowledge
- Q: how do you implement exactly once semantics with pubsub? A: Use dataflow pipeline
- if the duplicate message arrives after 10 minutes, it won't be deduped
- Q: how do you handle malformed data since dataflow will try to reprocess forever? A: Use dead-letter side-output in
ParDo
Streaming¶
- Unlike batch, which has one Global window, streaming has 3 windowing techniques
- Fixed: non-overlapping intervals, e.g. hourly, daily etc
- Sliding: can overlap, e.g. 30 sec duration window every 5 sec period. Useful for computing running aggregates
- Session: data-dependent window, e.g. session events, can't know the window size without looking at data
- Lag Time is the difference between expected arrival time and the actual arrive time of late arriving messages (due to various reasons)
- Watermark is way to determine when a window ends
- Heuristic Watermark techniques try to guess when a window should be closed and emitted
- Processing Late Data is done by triggering.
- data arriving later than watermark are, by default, discarded
- Triggering: decides when aggregation are triggered
- AfterWatermark: this is the default, data arrived within watermark are included
- Python API discards late data, but Java allows late data API
- AfterProcessingTime triggered after processing every interval based on system clock (v/s timestamp associated with the data)
- AfterCount: triggers after certain number of elements have been processed
- Composite trigger allow combining above triggers
- AfterWatermark: this is the default, data arrived within watermark are included
- AccumulationMode can be either
- accumulate data within current window is accumulated and emitted during each trigger
- discarding only data since the last trigger is emitted
Pub/Sub¶
- Pub/Sub has at least once delivery guarantee => message may be delivered multiple times
- Pub/Sub supports push/pull deliver: i.e. it can notify subscribers when new message arrives
- Saves events up to 7 days
- Pull and Push subscriptions
- Pull can be async or sync
- streaming pull is the most efficient, and push is the least efficient
- Order of messages isn't guaranteed
- topics are named as
projects/<project-id>/topics/<topic> - Ensure user/service account has correct permission on the topic (publisher/subscriber etc)
- seek allows replaying messages already acknowledged, two ways to do that
- snapshot: messages received before the snapshot are restored and messages received after are new
- timestamp: messages received before a specific time are considered read
- timestamp can be moved forward to discard unprocessed messages
Airflow¶
- concepts
- Operator: A description of a single task. E.g.
{Bash,Python,Jdbc,Oracle,MySql,S3FileTransfer}Operator - Task: Parameterized Operator
- Task Instance: A specific run of a Task
- DAG: Collections of Tasks with their dependencies
- Executor: runs a task. Concurrent tasks are limited by number of available executors
- dependencies:
task1 >> task2, task1 is upstream to task2 - XComs: mechanism to let tasks exchange messages
- Variable: key-value store to share settings, values etc
- Operator: A description of a single task. E.g.