Data Vault¶

DV are designed to ease storing data than user queries
Data Vaults are not directly accessible by users.
Data Marts are mostly virtual that source data from Raw and Business DV
Users access data through Business Access Layer
V/S Integrated EDW
Integrated model is too complex to query, too many joins
Cost to load data is too high, inter-dependencies make for complex ETL
DV advantages:
- incremental delivery, agile, flexibility, reduced complexity, high performance parallel loads,
- incremental modelling
DV V1 vs V2 between DV1 and DV2

Feature	Data Vault 1	Data Vault2
scope	modeling technique	methodology
Surrogate Key	-	HashKey
Links	relation, event or transaction	only business relations
Reference data	Cleaned	Raw
Analytics	Reporting, BI	ML, Data Science, exploration
Satellite (?)	End-date	Insert-only

Goals¶

Source: Teradata Case Study
Agility
Simplified, incremental EDW development
Formulaic ETL/ELT patterns
Generation of artifacts
Scalability
Enable parallel loading as much as possible, so that very large implementations can scale out without the need for major redesign
Traceability
Emphasizes the need to trace where all the data in the database came from
Complete (Atomic)
No distinction between good and bad data ("bad" => not conforming to business rules).
a Data Vault stores "a single version of the facts" as opposed to storing "a single version of the truth"
Flexibility - Resilience to Change
Separation of the things that do not change from those that do.
Explicitly separating structural information from descriptive attributes.
Optimized for storage and sacrifices consumption efficiency to do so.

Hubs: Business Keys
Represent a core business concept
Source-system agnostic
Immutable
Stores the Surrogate Key and the Natural Key(s)
PK is always a single value
Always connected via Links
no FK or parent/child hubs, instead use links
table structure consists of: NK, Key (Hash in DV2), source, load_ts
Links: Associations / Transactions
Represent events (eg transactions) or relationship between 2 or more Hubs or links
links are Many-to-Many, for flexibility, even for FK relationships that are 1-to-Many
Only stores the linked Hubs’ (Surrogate) keys
Immutable
Are not related to a source system
Satellites: Descriptors
Contains the descriptive data (No Natural Keys stored)
Can only join to a single Hub or Link
Is subject to change. (History is preserved over time)
Can be specifically related to a source system
Hubs and Links can have many satellites
Split based upon: information type, change rate
table structure: Key, load_dts, non-key columns, source, Hash of all columns (DV2 only for quick CDC)

Raw Data Vault:
as it (contents) exists in source-system (except maybe format) and this makes it auditable
no data cleaning nor conformation
similar to acquisition layer in traditional DW
Business Data Vault:
contains data derived from raw data vault by applying business rules
closer to Integrated Model
conformed and cleansed data and business rules applied
Offers Query Assistant Tables for ease of querying and performance
- Point-in-time (snapshot) tables for performance
- contains hub/link key, PIR snapshot date and corresponding satellite dates
- no satellite data is stored
- may contain natural-key for performance
- Bridge Tables
- a table containing keys from more than one hub and/or link
- no satellite data is stored
- Derived Satellite tables contain, pre-calculated (e.g. summary) data, exist only in Business DV and are attached to Hub or Link
also includes stand-alone reference data such as calendars, code tables can also be used