The ETL Process 🏗️➡️📦

How does data get from thousands of different bank ATMs into one central Data Warehouse? It follows the ETL Process. This is the most critical technical stage of data warehousing.


Loading diagram…


1. Extract (Data Acquisition)

In this first step, the system identifies and captures the relevant data from various source systems.

  • Source Diversity: Pulling data from relational databases (SQL), legacy mainframes, cloud-based CRM systems (Salesforce), and even simple Excel or JSON files.
  • Data Validation during Extraction: Performing basic checks to ensure the source data is readable and hasn't been corrupted during the connection.
  • Change Data Capture (CDC): Using intelligent triggers to identify only the records that have been added or modified since the last run, reducing the load on the network.
  • Extraction Frequency: Setting up schedules (e.g., real-time streaming for stock prices vs. nightly batches for HR data).
  • Staging Area: Moving extracted data into a temporary "Staging" database where it can be cleaned without affecting the original source apps.

2. Transform (Data Quality & Logic)

This is where the "Magical" cleaning and re-shaping happens. It is the most complex part of ETL.

  • Data Scrubbing (Cleaning): Removing duplicate customers, fixing spelling errors, and handling "NULL" or missing values appropriately.
  • Format Standardization: Converting all dates to a single format (e.g., ISO 8601) and ensuring all currencies are converted to a "Base Currency" using live exchange rates.
  • Business Logic Application: Applying complex rules like "Calculate Tax based on Region" or "Identify if this is a High-Value Customer based on spend."
  • Data Aggregation: Summarizing 1,000 individual daily sales into a single "Total Daily Sales" row to save space and speed up top-level reports.
  • Referential Integrity Checks: Ensuring that a "Sale" record doesn't exist for a "Product" that doesn't exist in the dimension table.

3. Load (Data Placement)

The final stage where the cleaned and transformed data is physically written into the Data Warehouse.

  • Full Load: Typically done once during the initial setup—moving the entire history of the company into the warehouse.
  • Incremental (Delta) Load: The ongoing process of adding only new data points. This is much faster and more efficient for daily updates.
  • Index Management: Temporarily turning off database indexes during the load to speed up the writing process, then rebuilding them once the load is complete.
  • Audit Logging: Creating a detailed record of how many rows were loaded, how many were rejected (due to errors), and how long the process took.
  • Post-Load Verification: Running final "Balance Checks" to ensure that the sum of sales in the Warehouse matches the sum of sales in the Source system.

Loading stats…


Summary

  • Extract: Pull the data from sources.
  • Transform: Clean and format the data (Crucial for GIGO prevention).
  • Load: Save the data into the analytical warehouse.
  • ETL is the "Pipeline" that keeps a Data Warehouse alive with fresh data.

Quiz Time! 🎯

Loading quiz…