Skip to content

Data Engineering

Data engineering is the discipline of designing, building, and maintaining the systems and infrastructure that collect, store, transform, and serve data at scale. While data scientists build models and analysts generate insights, data engineers build the pipelines and platforms that make all of that possible. Without reliable data infrastructure, even the most sophisticated machine learning model is useless β€” garbage in, garbage out.


The Data Lifecycle

Every piece of data moves through a lifecycle from creation to consumption. Understanding this lifecycle is the foundation of data engineering.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Generate │───▢│ Ingest │───▢│ Store │───▢│ Transform │───▢│ Serve β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ Applicationsβ”‚ β”‚ APIs, CDC, β”‚ β”‚ Data Lake, β”‚ β”‚ Clean, joinβ”‚ β”‚ Dashboards,β”‚
β”‚ IoT, Logs, β”‚ β”‚ Streams, β”‚ β”‚ Warehouse, β”‚ β”‚ aggregate, β”‚ β”‚ ML models, β”‚
β”‚ Databases β”‚ β”‚ Batch loadsβ”‚ β”‚ Lakehouse β”‚ β”‚ enrich β”‚ β”‚ APIs, Apps β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Generation

Data is created by applications, users, devices, and systems. Sources include:

  • Transactional databases β€” orders, user accounts, payments
  • Application logs β€” web server access logs, error logs, audit trails
  • IoT sensors β€” temperature readings, GPS coordinates, machine telemetry
  • Third-party APIs β€” social media feeds, weather data, financial market data
  • User interactions β€” clicks, page views, search queries

2. Ingestion

Data must be moved from its source into a storage or processing system. The two primary modes are:

ModeDescriptionLatencyUse Case
BatchCollect data over a period, then process it all at onceMinutes to hoursDaily reports, historical analysis
StreamingProcess data continuously as it arrivesMilliseconds to secondsFraud detection, real-time dashboards
Micro-batchSmall, frequent batches that approximate streamingSeconds to minutesNear-real-time analytics

3. Storage

Where and how data is stored determines what kinds of processing and querying are possible.

4. Transformation

Raw data is cleaned, validated, joined, aggregated, and enriched to create datasets that are useful for analysis and decision-making.

5. Serving

Transformed data is made available to consumers β€” BI dashboards, machine learning models, applications, and APIs.


Batch vs Stream Processing

The choice between batch and stream processing is one of the most fundamental decisions in data engineering.

Batch Processing

Batch processing operates on bounded datasets β€” finite collections of data processed as a unit.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Batch Processing β”‚
β”‚ β”‚
β”‚ Data accumulated ──▢ Processed as ──▢ Outputβ”‚
β”‚ over a time window a single job ready β”‚
β”‚ β”‚
β”‚ Example: Process all of yesterday's orders at 2 AM β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Characteristics:

  • High throughput β€” processes large volumes efficiently
  • Higher latency β€” results are not available until the batch completes
  • Simpler error handling β€” reprocess the entire batch on failure
  • Well-suited for historical analysis and periodic reporting

Common tools: Apache Spark, Apache Hadoop MapReduce, dbt, AWS Glue

Stream Processing

Stream processing operates on unbounded datasets β€” data that arrives continuously with no defined end.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stream Processing β”‚
β”‚ β”‚
β”‚ Event 1 ──▢ Process ──▢ Output β”‚
β”‚ Event 2 ──▢ Process ──▢ Output β”‚
β”‚ Event 3 ──▢ Process ──▢ Output β”‚
β”‚ ... ... ... β”‚
β”‚ β”‚
β”‚ Example: Flag suspicious transactions in real time β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Characteristics:

  • Low latency β€” results available in milliseconds to seconds
  • Lower throughput per event compared to batch
  • Complex error handling β€” must handle late and out-of-order events
  • Well-suited for real-time monitoring, alerting, and event-driven architectures

Common tools: Apache Kafka, Apache Flink, Apache Spark Streaming, AWS Kinesis

Lambda vs Kappa Architecture

Two prominent architectural patterns address the batch-stream duality:

ArchitectureDescriptionProsCons
LambdaSeparate batch and speed layers; results are mergedHandles both use cases; batch corrects stream errorsTwo codebases to maintain; complex merging logic
KappaStream-only; reprocess by replaying the event logSingle codebase; simpler architectureReprocessing can be slow; not ideal for all workloads
Lambda Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Source │────▢│ Batch Layer │──────────┐
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”œβ”€β”€β”€β–Άβ”‚ Serving Layerβ”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ │────▢│ Speed Layer β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Kappa Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Source │────▢│ Stream Layer │────▢│ Serving Layerβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Role of a Data Engineer

A data engineer is responsible for building and maintaining the infrastructure and frameworks that enable data generation, collection, storage, and analysis. The role sits at the intersection of software engineering, database administration, and data science.

Core Responsibilities

ResponsibilityDescription
Pipeline DevelopmentDesign and build ETL/ELT pipelines that move data from source to destination
Data ModelingDesign schemas and data models for warehouses and lakes
Infrastructure ManagementSet up and maintain data platforms (Spark clusters, Kafka brokers, warehouses)
Data QualityImplement validation, monitoring, and alerting for data pipelines
Performance OptimizationTune queries, partitioning strategies, and storage formats
CollaborationWork with analysts, data scientists, and product teams to understand data needs
AspectData EngineerData AnalystData ScientistML Engineer
FocusInfrastructure and pipelinesInsights and reportingModels and experimentsModel deployment
ToolsSpark, Airflow, SQL, PythonSQL, Excel, TableauPython, R, JupyterTensorFlow, MLflow
OutputReliable data platformsDashboards and reportsPredictive modelsProduction ML systems
SkillsSoftware engineering, databasesStatistics, visualizationMath, ML algorithmsDevOps, ML frameworks

The Modern Data Stack

The modern data stack refers to a collection of cloud-native tools that together form a complete data platform. Each layer handles a specific concern.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ The Modern Data Stack β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Ingest β”‚ β”‚ Store β”‚ β”‚Transform β”‚ β”‚ Serve β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ Fivetran β”‚ β”‚Snowflake β”‚ β”‚ dbt β”‚ β”‚ Looker β”‚ β”‚
β”‚ β”‚ Airbyte β”‚ β”‚BigQuery β”‚ β”‚ Spark β”‚ β”‚ Metabase β”‚ β”‚
β”‚ β”‚ Stitch β”‚ β”‚Redshift β”‚ β”‚ β”‚ β”‚ Superset β”‚ β”‚
β”‚ β”‚ Debezium β”‚ β”‚Databricksβ”‚ β”‚ β”‚ β”‚ Hex β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Orchestration β”‚ β”‚ Observability β”‚ β”‚
β”‚ β”‚ Airflow, Dagster, β”‚ β”‚ Monte Carlo, Great β”‚ β”‚
β”‚ β”‚ Prefect, Mage β”‚ β”‚ Expectations, dbt tests β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

LayerPurposePopular Tools
IngestionExtract data from sources and load into storageFivetran, Airbyte, Stitch, Debezium
StorageCentralized data repositorySnowflake, BigQuery, Redshift, Databricks
TransformationClean, model, and enrich datadbt, Apache Spark, Dataform
OrchestrationSchedule and coordinate pipeline tasksApache Airflow, Dagster, Prefect
Serving/BIVisualize and expose data to stakeholdersLooker, Metabase, Superset, Tableau
ObservabilityMonitor data quality and pipeline healthMonte Carlo, Great Expectations, Elementary
CatalogDiscover and document data assetsDataHub, Amundsen, Atlan

Data Formats and Storage

Choosing the right data format significantly impacts performance, storage cost, and query speed.

FormatTypeSchemaCompressionBest For
CSVRow-based textNonePoorSimple data exchange
JSONSemi-structured textFlexibleModerateAPIs, nested data
AvroRow-based binaryEmbeddedGoodStreaming, Kafka
ParquetColumnar binaryEmbeddedExcellentAnalytics, warehouses
ORCColumnar binaryEmbeddedExcellentHive ecosystem
Delta LakeColumnar + ACIDEmbeddedExcellentLakehouses

A Simple Pipeline Example

Here is a minimal pipeline that extracts data from an API, transforms it, and loads it into a database:

import requests
import pandas as pd
from sqlalchemy import create_engine
def extract(api_url: str) -> list[dict]:
"""Extract data from a REST API."""
response = requests.get(api_url, timeout=30)
response.raise_for_status()
return response.json()
def transform(raw_data: list[dict]) -> pd.DataFrame:
"""Clean and transform raw data."""
df = pd.DataFrame(raw_data)
# Remove duplicates
df = df.drop_duplicates(subset=["id"])
# Standardize column names
df.columns = [col.lower().replace(" ", "_") for col in df.columns]
# Parse dates
df["created_at"] = pd.to_datetime(df["created_at"])
# Filter out invalid records
df = df[df["amount"] > 0]
return df
def load(df: pd.DataFrame, table_name: str, conn_string: str):
"""Load transformed data into a database."""
engine = create_engine(conn_string)
df.to_sql(table_name, engine, if_exists="append", index=False)
print(f"Loaded {len(df)} rows into {table_name}")
# Run the pipeline
if __name__ == "__main__":
raw = extract("https://api.example.com/orders")
clean = transform(raw)
load(clean, "orders", "postgresql://user:pass@localhost/analytics")

Topics in This Section

ETL Pipelines

Learn about ETL vs ELT patterns, pipeline design, orchestration with Apache Airflow, transformations with dbt, and scheduling strategies.

Explore ETL Pipelines

Data Warehousing

Understand star and snowflake schemas, dimensional modeling, OLAP vs OLTP, data lakes, and the lakehouse paradigm.

Explore Data Warehousing

Stream Processing

Dive into Apache Spark Streaming, Flink, windowing strategies, exactly-once semantics, and watermarks.

Explore Stream Processing

Data Quality

Master data validation, Great Expectations, data contracts, lineage, governance, and observability.

Explore Data Quality


Further Reading

  • Fundamentals of Data Engineering by Joe Reis and Matt Housley
  • Designing Data-Intensive Applications by Martin Kleppmann
  • The Data Warehouse Toolkit by Ralph Kimball
  • Data Engineering with Python by Paul Crickard