Skip to content

Data Warehousing

A data warehouse is a centralized repository designed specifically for analytical queries and reporting. Unlike transactional databases that are optimized for fast reads and writes of individual records, data warehouses are optimized for scanning and aggregating large volumes of historical data. They are the backbone of business intelligence and data-driven decision making.


OLTP vs OLAP

The distinction between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) is fundamental to understanding why data warehouses exist.

CharacteristicOLTPOLAP
PurposeRun the businessAnalyze the business
UsersApplication users, customersAnalysts, data scientists, executives
Query patternRead/write single rowsScan and aggregate millions of rows
Data freshnessReal-timePeriodic (hourly, daily)
Schema designNormalized (3NF)Denormalized (star/snowflake)
Data volumeGigabytes to terabytesTerabytes to petabytes
Query complexitySimple, short-runningComplex, long-running
ConcurrencyThousands of usersTens to hundreds of users
ExampleProcess a paymentQuarterly revenue by region
OLTP (Transactional):
┌──────────┐ SELECT * FROM orders WHERE id = 12345
│ App/User │────▶ INSERT INTO orders VALUES (...)
│ │ UPDATE orders SET status = 'shipped' WHERE id = 12345
└──────────┘ (Fast, single-row operations)
OLAP (Analytical):
┌──────────┐ SELECT region, SUM(revenue), COUNT(DISTINCT customer_id)
│ Analyst │────▶ FROM fact_orders JOIN dim_customers JOIN dim_products
│ │ WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'
└──────────┘ GROUP BY region ORDER BY revenue DESC
(Slow, scan millions of rows)

Dimensional Modeling

Dimensional modeling is the design methodology used to structure data in a warehouse for optimal query performance and usability. Developed by Ralph Kimball, it organizes data into two types of tables: facts and dimensions.

Fact Tables

Fact tables store the measurable events of your business — transactions, clicks, shipments, payments. Each row represents a single event.

Characteristics:

  • Contain numeric measures (revenue, quantity, duration)
  • Contain foreign keys to dimension tables
  • Typically the largest tables in the warehouse
  • Append-only (new events are added, not updated)

Dimension Tables

Dimension tables store the descriptive context for facts — who, what, where, when, why. They provide the labels and attributes used to filter, group, and aggregate facts.

Characteristics:

  • Contain descriptive attributes (name, category, address)
  • Relatively small compared to fact tables
  • Change slowly over time (see SCD patterns)
  • Provide the “GROUP BY” and “WHERE” columns for queries
┌──────────────────────────────────────────────────────────┐
│ Dimensional Model │
│ │
│ dim_customer fact_orders dim_product │
│ ┌─────────────┐ ┌───────────────┐ ┌────────────┐ │
│ │customer_id │◄───│customer_id(FK)│──▶│product_id │ │
│ │name │ │product_id(FK) │ │name │ │
│ │email │ │date_id(FK) │ │category │ │
│ │segment │ │store_id(FK) │ │brand │ │
│ │city │ │───────────────│ │price │ │
│ │state │ │quantity │ └────────────┘ │
│ │country │ │revenue │ │
│ └─────────────┘ │discount │ dim_date │
│ │cost │ ┌────────────┐ │
│ dim_store └───────────────┘ │date_id │ │
│ ┌─────────────┐ │ │date │ │
│ │store_id │◄─────────┘ │year │ │
│ │store_name │ │quarter │ │
│ │city │ ┌──────────▶│month │ │
│ │region │ │ │day_of_week │ │
│ └─────────────┘ │ │is_holiday │ │
│ └───────────└────────────┘ │
└──────────────────────────────────────────────────────────┘

Star Schema

The star schema is the simplest and most common dimensional model. A central fact table is surrounded by denormalized dimension tables, forming a star shape.

Example: E-Commerce Star Schema

-- Dimension: Customers
CREATE TABLE dim_customers (
customer_id INTEGER PRIMARY KEY,
customer_name VARCHAR(200),
email VARCHAR(200),
segment VARCHAR(50), -- 'Consumer', 'Corporate', 'Enterprise'
city VARCHAR(100),
state VARCHAR(100),
country VARCHAR(100),
signup_date DATE
);
-- Dimension: Products
CREATE TABLE dim_products (
product_id INTEGER PRIMARY KEY,
product_name VARCHAR(300),
category VARCHAR(100),
subcategory VARCHAR(100),
brand VARCHAR(100),
unit_price DECIMAL(10,2)
);
-- Dimension: Date (pre-populated calendar table)
CREATE TABLE dim_date (
date_id INTEGER PRIMARY KEY,
full_date DATE,
year INTEGER,
quarter INTEGER,
month INTEGER,
month_name VARCHAR(20),
week INTEGER,
day_of_week INTEGER,
day_name VARCHAR(20),
is_weekend BOOLEAN,
is_holiday BOOLEAN,
fiscal_year INTEGER,
fiscal_quarter INTEGER
);
-- Fact: Orders
CREATE TABLE fact_orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER REFERENCES dim_customers(customer_id),
product_id INTEGER REFERENCES dim_products(product_id),
order_date_id INTEGER REFERENCES dim_date(date_id),
ship_date_id INTEGER REFERENCES dim_date(date_id),
quantity INTEGER,
unit_price DECIMAL(10,2),
discount DECIMAL(5,2),
revenue DECIMAL(12,2),
cost DECIMAL(12,2),
profit DECIMAL(12,2)
);

Advantages of Star Schema

AdvantageWhy It Matters
Simple joinsOnly one join between fact and each dimension
Fast queriesDenormalized dimensions avoid multi-table joins
IntuitiveBusiness users can understand the model easily
Tool-friendlyBI tools (Tableau, Looker) work natively with star schemas

Snowflake Schema

The snowflake schema normalizes dimension tables into sub-dimensions, reducing redundancy at the cost of more joins.

Star Schema: Snowflake Schema:
dim_product ──▶ fact_orders dim_category ──▶ dim_product ──▶ fact_orders
(category is ◀── dim_customer dim_brand ──┘ ◀── dim_customer
a column) │
dim_city ──▶ dim_state ──▶ dim_country

Star vs Snowflake Comparison

AspectStar SchemaSnowflake Schema
Dimension structureDenormalized (flat)Normalized (hierarchical)
Number of joinsFewerMore
Query performanceFaster (fewer joins)Slower (more joins)
Storage efficiencyMore redundancyLess redundancy
Ease of understandingSimplerMore complex
ETL complexitySimpler loadsMore complex loads
MaintenanceEasierHarder

Data Lakes

A data lake is a centralized storage repository that holds vast amounts of data in its raw, native format — structured, semi-structured, and unstructured.

┌──────────────────────────────────────────────────────────────┐
│ Data Lake │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Raw Zone │ │ Cleaned │ │ Curated │ │
│ │ │ │ Zone │ │ Zone │ │
│ │ - JSON logs │ │ │ │ │ │
│ │ - CSV dumps │ │ - Validated │ │ - Star schema│ │
│ │ - Avro events│ │ - Deduped │ │ - Aggregates │ │
│ │ - Images │ │ - Typed │ │ - ML features│ │
│ │ - Documents │ │ - Parquet │ │ - Parquet │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Storage: S3, GCS, ADLS Format: Parquet, Delta, Iceberg │
└──────────────────────────────────────────────────────────────┘

Data Lake vs Data Warehouse

AspectData LakeData Warehouse
Data formatRaw, any formatStructured, modeled
SchemaSchema-on-readSchema-on-write
UsersData engineers, data scientistsAnalysts, business users
Query engineSpark, Presto, AthenaBuilt-in SQL engine
Data typesStructured + semi-structured + unstructuredStructured only
CostCheap storage (object storage)Expensive compute + storage
GovernanceChallengingBuilt-in

The Lakehouse

The lakehouse architecture combines the best of both data lakes and data warehouses — cheap, scalable storage of a lake with the ACID transactions, schema enforcement, and query performance of a warehouse.

┌───────────────────────────────────────────────────────────┐
│ Lakehouse Architecture │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Query / BI / ML Layer │ │
│ │ (SQL queries, dashboards, notebooks) │ │
│ └────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌────────────────────▼─────────────────────────────┐ │
│ │ Table Format / Metadata Layer │ │
│ │ (Delta Lake, Apache Iceberg, Apache Hudi) │ │
│ │ │ │
│ │ Features: │ │
│ │ - ACID transactions │ │
│ │ - Schema enforcement and evolution │ │
│ │ - Time travel (query historical versions) │ │
│ │ - Partition management │ │
│ │ - Merge / upsert / delete support │ │
│ └────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌────────────────────▼─────────────────────────────┐ │
│ │ Object Storage Layer │ │
│ │ (S3, GCS, ADLS — Parquet files) │ │
│ └──────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘

Table Formats Compared

FeatureDelta LakeApache IcebergApache Hudi
CreatorDatabricksNetflixUber
ACID transactionsYesYesYes
Time travelYesYesYes
Schema evolutionYesYesYes
Partition evolutionLimitedYes (hidden partitions)Yes
Engine supportSpark, FlinkSpark, Flink, Trino, PrestoSpark, Flink
EcosystemDatabricks-centricEngine-agnosticSpark-centric

Lakehouse Example with Delta Lake

from pyspark.sql import SparkSession
from delta import DeltaTable
spark = SparkSession.builder \
.appName("LakehouseExample") \
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension") \
.getOrCreate()
# Write data as a Delta table (ACID transaction)
orders_df = spark.read.parquet("s3://raw/orders/2024/")
orders_df.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("order_date") \
.save("s3://lakehouse/fact_orders/")
# Upsert (merge) new data into existing Delta table
new_orders = spark.read.parquet("s3://staging/new_orders/")
delta_table = DeltaTable.forPath(spark, "s3://lakehouse/fact_orders/")
delta_table.alias("target").merge(
new_orders.alias("source"),
"target.order_id = source.order_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Time travel — query a previous version
historical = spark.read \
.format("delta") \
.option("versionAsOf", 5) \
.load("s3://lakehouse/fact_orders/")
# Or query as of a specific timestamp
yesterday = spark.read \
.format("delta") \
.option("timestampAsOf", "2024-06-15T00:00:00Z") \
.load("s3://lakehouse/fact_orders/")

Cloud Data Warehouses

WarehouseCloudPricing ModelKey Strength
SnowflakeMulti-cloudSeparate storage + computeElastic scaling, data sharing
BigQueryGCPServerless, per-queryZero infrastructure management
RedshiftAWSProvisioned or serverlessDeep AWS integration
Databricks SQLMulti-cloudCompute-basedUnified lakehouse platform
Azure SynapseAzureProvisioned or serverlessMicrosoft ecosystem integration

Choosing a Warehouse

FactorRecommendation
Cloud provider lock-inChoose Snowflake or Databricks for multi-cloud portability
Existing cloudChoose the native warehouse for deepest integration
BudgetBigQuery serverless is cost-effective for unpredictable workloads
Data science integrationDatabricks excels at combining SQL analytics with ML
SimplicityBigQuery requires the least infrastructure management

Quick Reference: Modeling Checklist

StepQuestion
Identify the business processWhat events are you measuring? (orders, clicks, shipments)
Declare the grainWhat does one row in the fact table represent?
Choose dimensionsWhat context do users need for filtering and grouping?
Define measuresWhat numeric values will be aggregated? (SUM, COUNT, AVG)
Design the date dimensionDoes it include fiscal periods, holidays, and day-of-week attributes?
Handle slowly changing dimsWhich SCD type fits each dimension attribute?
Partition the fact tableBy date? By region? What is the query pattern?

Next Steps