Skip to content

Data Warehouse

  • If data-marts used, DW doesn't necessarily need a dimensional model, but it feeds a dimensional model
  • is immutable and history preserving
  • comparing Kimball v/s Inmon methodology
  • Slowly Changing Dimensions
  • Type 1: Overwrite, Type 2: add new row, Type 3: add new attribute, Type 4: add history table, Type 6: Combination of types 1+2+3
  • Type 2/4/6 advocate use of new surrogate key to be generated for each revision
  • Type 3 only captures one previous history (or how many attributes are added)
  • lambda architecture: uses both batch and stream architecture to provide two datasets with different qualities
  • batch layer: has high latency, high accuracy and comprehensive. e.g. hadoop
  • speed layer: has low latency, low accuracy. e.g. stream based processing such as Apache storm, spark, etc
  • serving layer: stores data from both, batch and speed layers to satisfy user queries. e.g. druid, cassandra, HBase
  • Data Warehouse Types
DW Data type Use
Traditional normalized,dimensional analytics and BI
Logical federated, virtual layer easy consolidation
Real-time streaming data real-time analytics
Context-independent graph, non-relational data science, data exploration, ad-hoc

Kimball methodology

  • DW and data-marts designs are driven by business processes and questions (top-down approach)
  • Federated Dimensional Models
  • EDW is logical collection of decentralized, independent data-marts
  • Integration via conformed dimensions
  • Cons
  • highly dependent on business rules => business rule changes require dimensions need to be reconfigured and reloaded
  • high load failures, slow load times for large data due to,
    • complexity in loading conformed dimensions
    • load dependency
    • dimensional overload
  • real-time streaming is cumbersome due to common technique that involves
    • adding a dummy record for dimension
    • then perform CDC to add another row (to stick to RO EDW)
  • If business needs cubes, they are often placed downstream, which takes a lot of resources and need constant syncing
  • Emphasis on dimensions, separating and protecting PII is often an afterthought and bolted-on approach
  • Difficult for data scientists since they need raw data and are often given access to landing zone data, which
    • does not contain integrated business keys
    • may not have deltas identified
    • no metadata or governance

Inmon methodology

  • Emphasis on creating EDW to represent every entity (bottom-up)
  • Normalized (3NF) model
  • EDW consists of centralized normalized tables (off-limits to users)
  • Users access dependent data-marts
  • Cons
  • PK-SK => data loading is highly order dependent
  • Some implementations of PK change require copying child rows, since start_date is part of the PK

Data Mart

  • Subject area specific
  • Usually aggregated
  • User's entry point to access data
  • Can be a dependent or an independent data-mart

Access Layer

  • Base Tables -> Core (1:1) views -> Access Control -> Semantic -> Users
  • Access Control: Security, Privacy and Bypass
  • A Relational Model focuses on capturing business rules
  • A Dimensional Model focuses on evaluation, that is, monitoring business through metrics
  • The relationships in a dimensional model represent navigational paths v/s business rules in relational model
  • Has measures, such as, amounts, counts, duration, that are mathematical
  • Has meters, that is fact tables, that servers as buckets for measures
    • Grain is meter’s lowest level of detail
  • Has dimensions, are various ways to aggregate/filter measures
  • Supports Navigation, such as drill up/down and across (cross meter measurements)
  • Normal red flags: normalized structures, fuzzy grain, subtypes, too abstract
  • Most Semantic Models are Dimensional Models, but can be
  • Relational or Dimensional
  • Conceptual, Logical or Physical