Skip to content

Data Warehouse

  • If data-marts used, DW doesn't necessarily need a dimensional model, but it feeds a dimensional model
  • is immutable and history preserving
  • comparing Kimball v/s Inmon methodology
  • Slowly Changing Dimensions
    • Type 1: Overwrite, Type 2: add new row, Type 3: add new attribute, Type 4: add history table, Type 6: Combination of types 1+2+3
    • Type 2/4/6 advocate use of new surrogate key to be generated for each revision
    • Type 3 only captures one previous history (or how many attributes are added)
  • lambda architecture: uses both batch and stream architecture to provide two datasets with different qualities
    • batch layer: has high latency, high accuracy and comprehensive. e.g. hadoop
    • speed layer: has low latency, low accuracy. e.g. stream based processing such as Apache storm, spark, etc
    • serving layer: stores data from both, batch and speed layers to satisfy user queries. e.g. druid, cassandra, HBase
  • Data Warehouse Types
DW Data type Use
Traditional normalized,dimensional analytics and BI
Logical federated, virtual layer easy consolidation
Real-time streaming data real-time analytics
Context-independent graph, non-relational data science, data exploration, ad-hoc

Kimball methodology

  • DW and data-marts designs are driven by business processes and questions (top-down approach)
  • Federated Dimensional Models
  • EDW is logical collection of decentralized, independent data-marts
  • Integration via conformed dimensions
  • Cons
    • highly dependent on business rules => business rule changes require dimensions need to be reconfigured and reloaded
    • high load failures, slow load times for large data due to,
      • complexity in loading conformed dimensions
      • load dependency
      • dimensional overload
    • real-time streaming is cumbersome due to common technique that involves
      • adding a dummy record for dimension
      • then perform CDC to add another row (to stick to RO EDW)
    • If business needs cubes, they are often placed downstream, which takes a lot of resources and need constant syncing
    • Emphasis on dimensions, separating and protecting PII is often an afterthought and bolted-on approach
    • Difficult for data scientists since they need raw data and are often given access to landing zone data, which
      • does not contain integrated business keys
      • may not have deltas identified
      • no metadata or governance

Inmon methodology

  • Emphasis on creating EDW to represent every entity (bottom-up)
  • Normalized (3NF) model
  • EDW consists of centralized normalized tables (off-limits to users)
  • Users access dependent data-marts
  • Cons
    • PK-SK => data loading is highly order dependent
    • Some implementations of PK change require copying child rows, since start_date is part of the PK

Data Mart

  • Subject area specific
  • Usually aggregated
  • User's entry point to access data
  • Can be a dependent or an independent data-mart

Access Layer

  • Base Tables -> Core (1:1) views -> Access Control -> Semantic -> Users
    • Access Control: Security, Privacy and Bypass
  • A Relational Model focuses on capturing business rules
  • A Dimensional Model focuses on evaluation, that is, monitoring business through metrics
    • The relationships in a dimensional model represent navigational paths v/s business rules in relational model
    • Has measures, such as, amounts, counts, duration, that are mathematical
    • Has meters, that is fact tables, that servers as buckets for measures
      • Grain is meter’s lowest level of detail
    • Has dimensions, are various ways to aggregate/filter measures
    • Supports Navigation, such as drill up/down and across (cross meter measurements)
    • Normal red flags: normalized structures, fuzzy grain, subtypes, too abstract
  • Most Semantic Models are Dimensional Models, but can be
    • Relational or Dimensional
    • Conceptual, Logical or Physical