Skip to content

Architecture

Divided into two classic planes

Control Plane Data Plane
Managed User control
Databricks cloud account User cloud account
Web App, Config, Notebooks, Repos, DBSQL Data, DBFS Root (Cloud storage)
Cluster manager Cluster

Serverless Data Plane

  • Databricks managed cluster
  • available only to users of DBSQL
  • Pros: instant (no cluster provisioning), minimal configuration, automatic software updates, reduced idle time, no over-provisioning
  • Elastic: scales up and down automatically

Data Architecture

Medallion Feature EDW Term
Bronze Raw Files Landing
Bronze Schema Validation Raw
Silver Cleansed/Validated Core
Silver Conformed Enriched
Gold Analytical Models Curated
Gold Reporting Models Semantic
  • Medallion architecture
    1. Bronze: Raw ingestion, historical, source schema validated
    2. Silver: cleansed, validated, conformed
    3. Gold: business-level aggregates, analytical model, reporting model
  • DW layers to
  • Delta Live Tables (DLT): declarative, full and incremental refresh, dependency management, checkpoint restart
  • Databricks Workflows: pipeline, written using DLT, dbt or other tools
  • Streaming
    • DLT, Spark Structured Streaming
  • ML: built-in
    • Frameworks: TensorFlow, Spark, Keras, XGBoost
    • Distributed training: Spark, TensorFlow
    • AutoML and hyperparameter tuning
    • GPU acceleration

Data Governance

  • Data discovery
  • Access Control
  • Data lineage
  • Cataloging
  • Auditing
  • Quality

Databricks Runtime DBR

  • Photon: Proprietary execution engine written in C++ which is not available to Apache Spark

Workspace

dbutil

  • A collection of utilities that can be invoked from notebook, cli
  • categories:
    • credentials
    • data: understand data, common: summarize
    • fs: same as magic token %fs in notebook; common methods: ls, head, mkdirs, cp, mount etc
    • jobs
    • library (session related)
    • meta (compiler hooks)
    • notebook (control notebook flow), run: (same as magic % run) run another notebook
    • secrets (manage secrets in notebooks)
    • widgets (create bound values of input widgets in notebook)
    • preview (in preview utilities)

Workspace options

  • Photon: An optimized C based execution engine, much faster than standard Spark JVM based engine
  • workers: type (CSP node type), min and max workers
  • driver: type, defaults to same as worker node
  • enable auto-scaling: scale up to max workers
  • terminate after: n minutes of activity
  • Access mode:
    • Single user:
      • Allows credential passthrough, that is, logged-in user's credentials are used to access object storage
    • Multi user:
  • Databricks Runtime: two categories, Standard and ML, with various versions available
    • Some ML runtimes support GPU