Home > Topics > Data Mining and Business Intelligence > Design and Development of Data Warehouse

Data Warehouse Design 🏛️📐

Building a Data Warehouse is like building a skyscraper; you need a blueprint (Schema). The way you arrange your tables determines how fast your reports will run.


Loading stats…


1. The Fact and Dimension Model

In a DW, we divide data into two types to make analysis easier:

  • Fact Table: The center of the model. It contains the Numbers (Quantifiable data) and "Foreign Keys" that link to dimensions.
    • Additive Facts: Numbers that can be summed up (e.g., Total Sales).
    • Non-Additive Facts: Numbers that cannot be summed (e.g., Unit Price, Percentage).
    • Granularity: The level of detail in the fact table (e.g., "One row per day" vs. "One row per individual sale").
  • Dimension Table: The surrounding tables. They contain the Context (Descriptive data).
    • Contextual Details: Attributes like Brand Name, Color, Customer City, and Date.
    • Filtering & Grouping: Dimensions are used to filter facts (e.g., "Show me Sales where Product = 'Laptop'").

2. Star Schema

The simplest and most common design in data warehousing.

  • Direct Connectivity: One central Fact table connected directly to multiple Dimension tables. There are no "Sub-dimensions."
  • De-normalized Structure: Because dimensions are not broken down, the system contains redundant data (e.g., the city "Mumbai" might be repeated 1,000 times), but this makes it faster.
  • Query Performance: Since the computer only has to perform a single "Join" between the Fact and any Dimension, reports generate almost instantly.
  • Simplicity: It is extremely easy for business users and tools like Power BI to understand and navigate.

3. Snowflake Schema

A more complex, highly organized version of the Star schema.

  • Normalized Dimensions: In this schema, dimension tables are broken down into smaller sub-tables (e.g., the 'Product' table links to a 'Supplier' table, which links to a 'City' table).
  • Reduced Redundancy: By normalizing data, you ensure that a city's name is stored only once in the entire database, saving storage space.
  • Data Integrity: Because data is structured in multiple levels, it is easier to maintain and update without creating inconsistencies.
  • Complex Joins: The trade-off is speed. To get a simple report, the computer might have to join 5 or 6 tables, which can be slow for massive datasets.

4. Galaxy Schema (Fact Constellation)

For very large companies, a single Fact table isn't enough.

  • Multiple Fact Tables: A Galaxy schema contains multiple Fact tables (e.g., "Sales Fact" and "Inventory Fact") that share the same Dimension tables.
  • Shared Dimensions: Both Sales and Inventory might use the same "Time" and "Product" dimensions.
  • Cross-Functional Analysis: This allows a manager to see how "Sales Trends" are affecting "Warehouse Inventory" levels in one single report.
  • Complex Architecture: This is the most realistic schema used by global giants like Walmart or Amazon.

Loading comparison…


Summary

  • Fact Tables store purely numerical data for analysis.
  • Dimension Tables store descriptions of those numbers.
  • Star Schema is the gold standard for Speed.
  • Snowflake Schema is the gold standard for Data Efficiency.

Quiz Time! 🎯

Loading quiz…