Architecture
Divided into two classic planes
| Control Plane |
Data Plane |
| Managed |
User control |
| Databricks cloud account |
User cloud account |
| Web App, Config, Notebooks, Repos, DBSQL |
Data, DBFS Root (Cloud storage) |
| Cluster manager |
Cluster |
Serverless Data Plane
- Databricks managed cluster
- available only to users of DBSQL
- Pros: instant (no cluster provisioning), minimal configuration, automatic software updates, reduced idle time, no over-provisioning
- Elastic: scales up and down automatically
Data Architecture
| Medallion |
Feature |
EDW Term |
| Bronze |
Raw Files |
Landing |
| Bronze |
Schema Validation |
Raw |
| Silver |
Cleansed/Validated |
Core |
| Silver |
Conformed |
Enriched |
| Gold |
Analytical Models |
Curated |
| Gold |
Reporting Models |
Semantic |
- Medallion architecture
- Bronze: Raw ingestion, historical, source schema validated
- Silver: cleansed, validated, conformed
- Gold: business-level aggregates, analytical model, reporting model
- DW layers to
- Delta Live Tables (DLT): declarative, full and incremental refresh, dependency management, checkpoint restart
- Databricks Workflows: pipeline, written using DLT, dbt or other tools
- Streaming
- DLT, Spark Structured Streaming
- ML: built-in
- Frameworks: TensorFlow, Spark, Keras, XGBoost
- Distributed training: Spark, TensorFlow
- AutoML and hyperparameter tuning
- GPU acceleration
Data Governance
- Data discovery
- Access Control
- Data lineage
- Cataloging
- Auditing
- Quality
Databricks Runtime DBR
- Photon: Proprietary execution engine written in C++ which is not available to Apache Spark
Workspace
dbutil
- A collection of utilities that can be invoked from notebook, cli
- categories:
credentials
data: understand data, common: summarize
fs: same as magic token %fs in notebook; common methods: ls, head, mkdirs, cp, mount etc
jobs
library (session related)
meta (compiler hooks)
notebook (control notebook flow), run: (same as magic % run) run another notebook
secrets (manage secrets in notebooks)
widgets (create bound values of input widgets in notebook)
preview (in preview utilities)
Workspace options
- Photon: An optimized C based execution engine, much faster than standard Spark JVM based engine
- workers: type (CSP node type), min and max workers
- driver: type, defaults to same as worker node
- enable auto-scaling: scale up to max workers
- terminate after:
n minutes of activity
- Access mode:
- Single user:
- Allows credential passthrough, that is, logged-in user's credentials are used to access object storage
- Multi user:
- Databricks Runtime: two categories, Standard and ML, with various versions available
- Some ML runtimes support GPU