Skip to content

Data Vault

  • DV are designed to ease storing data than user queries
  • Data Vaults are not directly accessible by users.
  • Data Marts are mostly virtual that source data from Raw and Business DV
  • Users access data through Business Access Layer
  • V/S Integrated EDW
  • Integrated model is too complex to query, too many joins
  • Cost to load data is too high, inter-dependencies make for complex ETL
  • DV advantages:
    • incremental delivery, agile, flexibility, reduced complexity, high performance parallel loads,
    • incremental modelling
  • DV V1 vs V2 between DV1 and DV2
Feature Data Vault 1 Data Vault2
scope modeling technique methodology
Surrogate Key - HashKey
Links relation, event or transaction only business relations
Reference data Cleaned Raw
Analytics Reporting, BI ML, Data Science, exploration
Satellite (?) End-date Insert-only

Goals

  • Source: Teradata Case Study
  • Agility
  • Simplified, incremental EDW development
  • Formulaic ETL/ELT patterns
  • Generation of artifacts
  • Scalability
  • Enable parallel loading as much as possible, so that very large implementations can scale out without the need for major redesign
  • Traceability
  • Emphasizes the need to trace where all the data in the database came from
  • Complete (Atomic)
  • No distinction between good and bad data ("bad" => not conforming to business rules).
  • a Data Vault stores "a single version of the facts" as opposed to storing "a single version of the truth"
  • Flexibility - Resilience to Change
  • Separation of the things that do not change from those that do.
  • Explicitly separating structural information from descriptive attributes.
  • Optimized for storage and sacrifices consumption efficiency to do so.

Model

  • Hubs: Business Keys
  • Represent a core business concept
  • Source-system agnostic
  • Immutable
  • Stores the Surrogate Key and the Natural Key(s)
  • PK is always a single value
  • Always connected via Links
  • no FK or parent/child hubs, instead use links
  • table structure consists of: NK, Key (Hash in DV2), source, load_ts
  • Links: Associations / Transactions
  • Represent events (eg transactions) or relationship between 2 or more Hubs or links
  • links are Many-to-Many, for flexibility, even for FK relationships that are 1-to-Many
  • Only stores the linked Hubs’ (Surrogate) keys
  • Immutable
  • Are not related to a source system
  • Satellites: Descriptors
  • Contains the descriptive data (No Natural Keys stored)
  • Can only join to a single Hub or Link
  • Is subject to change. (History is preserved over time)
  • Can be specifically related to a source system
  • Hubs and Links can have many satellites
  • Split based upon: information type, change rate
  • table structure: Key, load_dts, non-key columns, source, Hash of all columns (DV2 only for quick CDC)

Data Vault types

  • Raw Data Vault:
  • as it (contents) exists in source-system (except maybe format) and this makes it auditable
  • no data cleaning nor conformation
  • similar to acquisition layer in traditional DW
  • Business Data Vault:
  • contains data derived from raw data vault by applying business rules
  • closer to Integrated Model
  • conformed and cleansed data and business rules applied
  • Offers Query Assistant Tables for ease of querying and performance
    • Point-in-time (snapshot) tables for performance
    • contains hub/link key, PIR snapshot date and corresponding satellite dates
    • no satellite data is stored
    • may contain natural-key for performance
    • Bridge Tables
    • a table containing keys from more than one hub and/or link
    • no satellite data is stored
    • Derived Satellite tables contain, pre-calculated (e.g. summary) data, exist only in Business DV and are attached to Hub or Link
  • also includes stand-alone reference data such as calendars, code tables can also be used

Architecture

  • Landing zone: raw data from source, batch or real-time
  • Staging zone: from landing zone
  • Data Science: from landing zone, and EDW
  • EDW, Raw DV: from staging, landing zone, real-time feeds and Data Science
  • Information marts: two marts, one from EDW, and other from Data Science
  • visualization layer: from the two marts above
  • bi tooling, self service BI: from visualization layer

Methodology

  • Data acquisition team
  • larger or the only team initially until most sources have been captured
  • after that they document external sources to help data scientists
  • information provision team
  • provide business reports from EDW
  • soft business rules are decouple from EDW