Skip to content

Data Vault

  • DV are designed to ease storing data than user queries
  • Data Vaults are not directly accessible by users.
    • Data Marts are mostly virtual that source data from Raw and Business DV
    • Users access data through Business Access Layer
  • V/S Integrated EDW
    • Integrated model is too complex to query, too many joins
    • Cost to load data is too high, inter-dependencies make for complex ETL
    • DV advantages:
      • incremental delivery, agile, flexibility, reduced complexity, high performance parallel loads,
      • incremental modelling
  • DV V1 vs V2 between DV1 and DV2
Feature Data Vault 1 Data Vault2
scope modeling technique methodology
Surrogate Key - HashKey
Links relation, event or transaction only business relations
Reference data Cleaned Raw
Analytics Reporting, BI ML, Data Science, exploration
Satellite (?) End-date Insert-only

Goals

  • Source: Teradata Case Study
  • Agility
    • Simplified, incremental EDW development
    • Formulaic ETL/ELT patterns
    • Generation of artifacts
  • Scalability
    • Enable parallel loading as much as possible, so that very large implementations can scale out without the need for major redesign
  • Traceability
    • Emphasizes the need to trace where all the data in the database came from
  • Complete (Atomic)
    • No distinction between good and bad data ("bad" => not conforming to business rules).
    • a Data Vault stores "a single version of the facts" as opposed to storing "a single version of the truth"
  • Flexibility - Resilience to Change
    • Separation of the things that do not change from those that do.
    • Explicitly separating structural information from descriptive attributes.
    • Optimized for storage and sacrifices consumption efficiency to do so.

Model

  • Hubs: Business Keys
    • Represent a core business concept
    • Source-system agnostic
    • Immutable
    • Stores the Surrogate Key and the Natural Key(s)
    • PK is always a single value
    • Always connected via Links
    • no FK or parent/child hubs, instead use links
    • table structure consists of: NK, Key (Hash in DV2), source, load_ts
  • Links: Associations / Transactions
    • Represent events (eg transactions) or relationship between 2 or more Hubs or links
    • links are Many-to-Many, for flexibility, even for FK relationships that are 1-to-Many
    • Only stores the linked Hubs’ (Surrogate) keys
    • Immutable
    • Are not related to a source system
  • Satellites: Descriptors
    • Contains the descriptive data (No Natural Keys stored)
    • Can only join to a single Hub or Link
    • Is subject to change. (History is preserved over time)
    • Can be specifically related to a source system
    • Hubs and Links can have many satellites
    • Split based upon: information type, change rate
    • table structure: Key, load_dts, non-key columns, source, Hash of all columns (DV2 only for quick CDC)

Data Vault types:

  • Raw Data Vault:
    • as it (contents) exists in source-system (except maybe format) and this makes it auditable
    • no data cleaning nor conformation
    • similar to acquisition layer in traditional DW
  • Business Data Vault:
    • contains data derived from raw data vault by applying business rules
    • closer to Integrated Model
    • conformed and cleansed data and business rules applied
    • Offers Query Assistant Tables for ease of querying and performance
      • Point-in-time (snapshot) tables for performance
        • contains hub/link key, PIR snapshot date and corresponding satellite dates
        • no satellite data is stored
        • may contain natural-key for performance
      • Bridge Tables
        • a table containing keys from more than one hub and/or link
        • no satellite data is stored
      • Derived Satellite tables contain, pre-calculated (e.g. summary) data, exist only in Business DV and are attached to Hub or Link
    • also includes stand-alone reference data such as calendars, code tables can also be used

Architecture

  • Landing zone: raw data from source, batch or real-time
  • Staging zone: from landing zone
  • Data Science: from landing zone, and EDW
  • EDW, Raw DV: from staging, landing zone, real-time feeds and Data Science
  • Information marts: two marts, one from EDW, and other from Data Science
  • visualization layer: from the two marts above
  • bi tooling, self service BI: from visualization layer

Methodology

  • Data acquisition team
    • larger or the only team initially until most sources have been captured
    • after that they document external sources to help data scientists
  • information provision team
    • provide business reports from EDW
  • soft business rules are decouple from EDW