Data Architecture¶
Traditional¶
- Consists of:
- Data Ingestion: ELT, CDC, Streaming
- Data Transformation: ELT, Spark, Big Data tools in Data Lake
- Normalization and Aggregation: Data Warehouses, Data Marts, Backups
- Data Analytics: Cubes (for BI/BA), Presto (Data Science), File sharing
- Brings data from sources to consumers:
- Data Sources: source systems, Enterprise Applications, External/3rd party, Web/log, IoT
- Consumers: BI, Analytics, Data Science
- Drawbacks:
- Requires data movement, e.g. Spark to DW
- Multiple data copies: e.g. Data Lake, DW and Data Science could all have same data in different format
- Data discovery: with many platforms and multiple copies, it's hard to find needed data
- Data Governance: each platform/data-store needs to have its own data governance
- Data sharing: data can't be shared easily if they don't reside on the same platform
ODS - Operational Data Store¶
- Integrate corporate data from different heterogeneous data source for operational reporting
- Operation reporting on near real-time or current data
- Structured similar to the source systems, however some cleanup, transformation applied to ensure integrity
- Unlike DW,
- ODS is at a lowest granularity
- ODS data is short lived
- constantly updated
Data Lake¶
- Stores structure and unstructured data.
| Criterion | Data Warehouse | Data Lake |
|---|---|---|
| Data | Structure, Relational | All, IoT, logs, relational |
| Schema | On write | On Read |
| DQ | Cleaned | Raw |
| Analytics | Reporting, BI | ML, Data Science, exploration |
Risks/Limitations¶
- Siloed or unused data due to lack of governance or performance
- complex data pipelines due to vast variety of data types, governance and integration styles (CDC etc)
- Governance and security: limiting access to PII data, maintaining: metadata, data lineage and freshness
- Snowflake solution:
- simpler data-pipelines with SQL based ELT, SnowPipe
- easier access to semi-structured data using SQL
- Bring data into Snowflake or keep existing data-lake and use external tables, materialized views for better access
- Better security: automatic encryption, RBAC, HIPAA and FedRAMP certifications, data masking and external tokenization
- elastic scalability that caches most used data
- better performance: storage compression, zero-copy cloning, data-sharing
Data applications¶
Differences between Analytics, Science and M/L
- Data Analytics: Extract relevant information from a usually rather small dataset
- Data Science: Conduct operations over various data sources to prove or disprove a certain hypothesis
- Machine Learning: Develop software that learns by itself by extracting meaning from data
Analytical applications¶
- Real-time fraud detection
- click-stream analysis
- smart grid
- IoT devices