Data Warehouse¶
- If data-marts used, DW doesn't necessarily need a dimensional model, but it feeds a dimensional model
- is immutable and history preserving
- comparing Kimball v/s Inmon methodology
- Slowly Changing Dimensions
- Type 1: Overwrite, Type 2: add new row, Type 3: add new attribute, Type 4: add history table, Type 6: Combination of types 1+2+3
- Type 2/4/6 advocate use of new surrogate key to be generated for each revision
- Type 3 only captures one previous history (or how many attributes are added)
- lambda architecture: uses both batch and stream architecture to provide two datasets with different qualities
- batch layer: has high latency, high accuracy and comprehensive. e.g. hadoop
- speed layer: has low latency, low accuracy. e.g. stream based processing such as Apache storm, spark, etc
- serving layer: stores data from both, batch and speed layers to satisfy user queries. e.g. druid, cassandra, HBase
- Data Warehouse Types
| DW | Data type | Use |
|---|---|---|
| Traditional | normalized,dimensional | analytics and BI |
| Logical | federated, virtual layer | easy consolidation |
| Real-time | streaming data | real-time analytics |
| Context-independent | graph, non-relational | data science, data exploration, ad-hoc |
- Architecture Patterns
- Data Architecture: logical Data Modeling
Kimball methodology¶
- DW and data-marts designs are driven by business processes and questions (top-down approach)
- Federated Dimensional Models
- EDW is logical collection of decentralized, independent data-marts
- Integration via conformed dimensions
- Cons
- highly dependent on business rules => business rule changes require dimensions need to be reconfigured and reloaded
- high load failures, slow load times for large data due to,
- complexity in loading conformed dimensions
- load dependency
- dimensional overload
- real-time streaming is cumbersome due to common technique that involves
- adding a dummy record for dimension
- then perform CDC to add another row (to stick to RO EDW)
- If business needs cubes, they are often placed downstream, which takes a lot of resources and need constant syncing
- Emphasis on dimensions, separating and protecting PII is often an afterthought and bolted-on approach
- Difficult for data scientists since they need raw data and are often given access to landing zone data, which
- does not contain integrated business keys
- may not have deltas identified
- no metadata or governance
Inmon methodology¶
- Emphasis on creating EDW to represent every entity (bottom-up)
- Normalized (3NF) model
- EDW consists of centralized normalized tables (off-limits to users)
- Users access dependent data-marts
- Cons
- PK-SK => data loading is highly order dependent
- Some implementations of PK change require copying child rows, since
start_dateis part of the PK
Data Mart¶
- Subject area specific
- Usually aggregated
- User's entry point to access data
- Can be a dependent or an independent data-mart
Access Layer¶
- Base Tables -> Core (1:1) views -> Access Control -> Semantic -> Users
- Access Control: Security, Privacy and Bypass
- A Relational Model focuses on capturing business rules
- A Dimensional Model focuses on evaluation, that is, monitoring business through metrics
- The relationships in a dimensional model represent navigational paths v/s business rules in relational model
- Has measures, such as, amounts, counts, duration, that are mathematical
- Has meters, that is fact tables, that servers as buckets for measures
- Grain is meter’s lowest level of detail
- Has dimensions, are various ways to aggregate/filter measures
- Supports Navigation, such as drill up/down and across (cross meter measurements)
- Normal red flags: normalized structures, fuzzy grain, subtypes, too abstract
- Most Semantic Models are Dimensional Models, but can be
- Relational or Dimensional
- Conceptual, Logical or Physical