Data Warehouse¶

If data-marts used, DW doesn't necessarily need a dimensional model, but it feeds a dimensional model
is immutable and history preserving
comparing Kimball v/s Inmon methodology
Slowly Changing Dimensions
Type 1: Overwrite, Type 2: add new row, Type 3: add new attribute, Type 4: add history table, Type 6: Combination of types 1+2+3
Type 2/4/6 advocate use of new surrogate key to be generated for each revision
Type 3 only captures one previous history (or how many attributes are added)
lambda architecture: uses both batch and stream architecture to provide two datasets with different qualities
batch layer: has high latency, high accuracy and comprehensive. e.g. hadoop
speed layer: has low latency, low accuracy. e.g. stream based processing such as Apache storm, spark, etc
serving layer: stores data from both, batch and speed layers to satisfy user queries. e.g. druid, cassandra, HBase
Data Warehouse Types

DW	Data type	Use
Traditional	normalized,dimensional	analytics and BI
Logical	federated, virtual layer	easy consolidation
Real-time	streaming data	real-time analytics
Context-independent	graph, non-relational	data science, data exploration, ad-hoc

Kimball methodology¶

DW and data-marts designs are driven by business processes and questions (top-down approach)
Federated Dimensional Models
EDW is logical collection of decentralized, independent data-marts
Integration via conformed dimensions
Cons
highly dependent on business rules => business rule changes require dimensions need to be reconfigured and reloaded
high load failures, slow load times for large data due to,
- complexity in loading conformed dimensions
- load dependency
- dimensional overload
real-time streaming is cumbersome due to common technique that involves
- adding a dummy record for dimension
- then perform CDC to add another row (to stick to RO EDW)
If business needs cubes, they are often placed downstream, which takes a lot of resources and need constant syncing
Emphasis on dimensions, separating and protecting PII is often an afterthought and bolted-on approach
Difficult for data scientists since they need raw data and are often given access to landing zone data, which
- does not contain integrated business keys
- may not have deltas identified
- no metadata or governance

Emphasis on creating EDW to represent every entity (bottom-up)
Normalized (3NF) model
EDW consists of centralized normalized tables (off-limits to users)
Users access dependent data-marts
Cons
PK-SK => data loading is highly order dependent
Some implementations of PK change require copying child rows, since start_date is part of the PK

Base Tables -> Core (1:1) views -> Access Control -> Semantic -> Users
Access Control: Security, Privacy and Bypass
A Relational Model focuses on capturing business rules
A Dimensional Model focuses on evaluation, that is, monitoring business through metrics
The relationships in a dimensional model represent navigational paths v/s business rules in relational model
Has measures, such as, amounts, counts, duration, that are mathematical
Has meters, that is fact tables, that servers as buckets for measures
- Grain is meter’s lowest level of detail
Has dimensions, are various ways to aggregate/filter measures
Supports Navigation, such as drill up/down and across (cross meter measurements)
Normal red flags: normalized structures, fuzzy grain, subtypes, too abstract
Most Semantic Models are Dimensional Models, but can be
Relational or Dimensional
Conceptual, Logical or Physical