Skip to content

Data Architecture

Traditional

  • Consists of:
    • Data Ingestion: ELT, CDC, Streaming
    • Data Transformation: ELT, Spark, Big Data tools in Data Lake
    • Normalization and Aggregation: Data Warehouses, Data Marts, Backups
    • Data Analytics: Cubes (for BI/BA), Presto (Data Science), File sharing
  • Brings data from sources to consumers:
    • Data Sources: source systems, Enterprise Applications, External/3rd party, Web/log, IoT
    • Consumers: BI, Analytics, Data Science
  • Drawbacks:
    • Requires data movement, e.g. Spark to DW
    • Multiple data copies: e.g. Data Lake, DW and Data Science could all have same data in different format
      • Data discovery: with many platforms and multiple copies, it's hard to find needed data
      • Data Governance: each platform/data-store needs to have its own data governance
      • Data sharing: data can't be shared easily if they don't reside on the same platform

ODS - Operational Data Store

  • Integrate corporate data from different heterogeneous data source for operational reporting
  • Operation reporting on near real-time or current data
  • Structured similar to the source systems, however some cleanup, transformation applied to ensure integrity
  • Unlike DW,
    • ODS is at a lowest granularity
    • ODS data is short lived
    • constantly updated

Data Lake

  • Stores structure and unstructured data.
Criterion Data Warehouse Data Lake
Data Structure, Relational All, IoT, logs, relational
Schema On write On Read
DQ Cleaned Raw
Analytics Reporting, BI ML, Data Science, exploration

Risks/Limitations

  • Siloed or unused data due to lack of governance or performance
  • complex data pipelines due to vast variety of data types, governance and integration styles (CDC etc)
  • Governance and security: limiting access to PII data, maintaining: metadata, data lineage and freshness
  • Snowflake solution:
    • simpler data-pipelines with SQL based ELT, SnowPipe
    • easier access to semi-structured data using SQL
    • Bring data into Snowflake or keep existing data-lake and use external tables, materialized views for better access
    • Better security: automatic encryption, RBAC, HIPAA and FedRAMP certifications, data masking and external tokenization
    • elastic scalability that caches most used data
    • better performance: storage compression, zero-copy cloning, data-sharing

Data applications

Differences between Analytics, Science and M/L

  • Data Analytics: Extract relevant information from a usually rather small dataset
  • Data Science: Conduct operations over various data sources to prove or disprove a certain hypothesis
  • Machine Learning: Develop software that learns by itself by extracting meaning from data

Analytical applications

  • Real-time fraud detection
  • click-stream analysis
  • smart grid
  • IoT devices