Data Vault¶
- DV are designed to ease storing data than user queries
- Data Vaults are not directly accessible by users.
- Data Marts are mostly virtual that source data from Raw and Business DV
- Users access data through Business Access Layer
- V/S Integrated EDW
- Integrated model is too complex to query, too many joins
- Cost to load data is too high, inter-dependencies make for complex ETL
- DV advantages:
- incremental delivery, agile, flexibility, reduced complexity, high performance parallel loads,
- incremental modelling
- DV V1 vs V2 between DV1 and DV2
| Feature | Data Vault 1 | Data Vault2 |
|---|---|---|
| scope | modeling technique | methodology |
| Surrogate Key | - | HashKey |
| Links | relation, event or transaction | only business relations |
| Reference data | Cleaned | Raw |
| Analytics | Reporting, BI | ML, Data Science, exploration |
| Satellite (?) | End-date | Insert-only |
Goals¶
- Source: Teradata Case Study
- Agility
- Simplified, incremental EDW development
- Formulaic ETL/ELT patterns
- Generation of artifacts
- Scalability
- Enable parallel loading as much as possible, so that very large implementations can scale out without the need for major redesign
- Traceability
- Emphasizes the need to trace where all the data in the database came from
- Complete (Atomic)
- No distinction between good and bad data ("bad" => not conforming to business rules).
- a Data Vault stores "a single version of the facts" as opposed to storing "a single version of the truth"
- Flexibility - Resilience to Change
- Separation of the things that do not change from those that do.
- Explicitly separating structural information from descriptive attributes.
- Optimized for storage and sacrifices consumption efficiency to do so.
Model¶
- Hubs: Business Keys
- Represent a core business concept
- Source-system agnostic
- Immutable
- Stores the Surrogate Key and the Natural Key(s)
- PK is always a single value
- Always connected via Links
- no FK or parent/child hubs, instead use links
- table structure consists of: NK, Key (Hash in DV2), source, load_ts
- Links: Associations / Transactions
- Represent events (eg transactions) or relationship between 2 or more Hubs or links
- links are Many-to-Many, for flexibility, even for FK relationships that are 1-to-Many
- Only stores the linked Hubs’ (Surrogate) keys
- Immutable
- Are not related to a source system
- Satellites: Descriptors
- Contains the descriptive data (No Natural Keys stored)
- Can only join to a single Hub or Link
- Is subject to change. (History is preserved over time)
- Can be specifically related to a source system
- Hubs and Links can have many satellites
- Split based upon: information type, change rate
- table structure: Key, load_dts, non-key columns, source, Hash of all columns (DV2 only for quick CDC)
Data Vault types:¶
- Raw Data Vault:
- as it (contents) exists in source-system (except maybe format) and this makes it auditable
- no data cleaning nor conformation
- similar to acquisition layer in traditional DW
- Business Data Vault:
- contains data derived from raw data vault by applying business rules
- closer to Integrated Model
- conformed and cleansed data and business rules applied
- Offers Query Assistant Tables for ease of querying and performance
- Point-in-time (snapshot) tables for performance
- contains hub/link key, PIR snapshot date and corresponding satellite dates
- no satellite data is stored
- may contain natural-key for performance
- Bridge Tables
- a table containing keys from more than one hub and/or link
- no satellite data is stored
- Derived Satellite tables contain, pre-calculated (e.g. summary) data, exist only in Business DV and are attached to Hub or Link
- Point-in-time (snapshot) tables for performance
- also includes stand-alone reference data such as calendars, code tables can also be used
Architecture¶
- Landing zone: raw data from source, batch or real-time
- Staging zone: from landing zone
- Data Science: from landing zone, and EDW
- EDW, Raw DV: from staging, landing zone, real-time feeds and Data Science
- Information marts: two marts, one from EDW, and other from Data Science
- visualization layer: from the two marts above
- bi tooling, self service BI: from visualization layer
Methodology¶
- Data acquisition team
- larger or the only team initially until most sources have been captured
- after that they document external sources to help data scientists
- information provision team
- provide business reports from EDW
- soft business rules are decouple from EDW