Data Warehouse Design 🏛️📐
Building a Data Warehouse is like building a skyscraper; you need a blueprint (Schema). The way you arrange your tables determines how fast your reports will run.
Loading stats…
1. The Fact and Dimension Model
In a DW, we divide data into two types to make analysis easier:
- Fact Table: The center of the model. It contains the Numbers (Quantifiable data) and "Foreign Keys" that link to dimensions.
- Additive Facts: Numbers that can be summed up (e.g., Total Sales).
- Non-Additive Facts: Numbers that cannot be summed (e.g., Unit Price, Percentage).
- Granularity: The level of detail in the fact table (e.g., "One row per day" vs. "One row per individual sale").
- Dimension Table: The surrounding tables. They contain the Context (Descriptive data).
- Contextual Details: Attributes like Brand Name, Color, Customer City, and Date.
- Filtering & Grouping: Dimensions are used to filter facts (e.g., "Show me Sales where Product = 'Laptop'").
2. Star Schema
The simplest and most common design in data warehousing.
- Direct Connectivity: One central Fact table connected directly to multiple Dimension tables. There are no "Sub-dimensions."
- De-normalized Structure: Because dimensions are not broken down, the system contains redundant data (e.g., the city "Mumbai" might be repeated 1,000 times), but this makes it faster.
- Query Performance: Since the computer only has to perform a single "Join" between the Fact and any Dimension, reports generate almost instantly.
- Simplicity: It is extremely easy for business users and tools like Power BI to understand and navigate.
3. Snowflake Schema
A more complex, highly organized version of the Star schema.
- Normalized Dimensions: In this schema, dimension tables are broken down into smaller sub-tables (e.g., the 'Product' table links to a 'Supplier' table, which links to a 'City' table).
- Reduced Redundancy: By normalizing data, you ensure that a city's name is stored only once in the entire database, saving storage space.
- Data Integrity: Because data is structured in multiple levels, it is easier to maintain and update without creating inconsistencies.
- Complex Joins: The trade-off is speed. To get a simple report, the computer might have to join 5 or 6 tables, which can be slow for massive datasets.
4. Galaxy Schema (Fact Constellation)
For very large companies, a single Fact table isn't enough.
- Multiple Fact Tables: A Galaxy schema contains multiple Fact tables (e.g., "Sales Fact" and "Inventory Fact") that share the same Dimension tables.
- Shared Dimensions: Both Sales and Inventory might use the same "Time" and "Product" dimensions.
- Cross-Functional Analysis: This allows a manager to see how "Sales Trends" are affecting "Warehouse Inventory" levels in one single report.
- Complex Architecture: This is the most realistic schema used by global giants like Walmart or Amazon.
Loading comparison…
Summary
- Fact Tables store purely numerical data for analysis.
- Dimension Tables store descriptions of those numbers.
- Star Schema is the gold standard for Speed.
- Snowflake Schema is the gold standard for Data Efficiency.
Quiz Time! 🎯
Loading quiz…