ETL Pipelines
Learn about ETL vs ELT patterns, pipeline design, orchestration with Apache Airflow, transformations with dbt, and scheduling strategies.
Data engineering is the discipline of designing, building, and maintaining the systems and infrastructure that collect, store, transform, and serve data at scale. While data scientists build models and analysts generate insights, data engineers build the pipelines and platforms that make all of that possible. Without reliable data infrastructure, even the most sophisticated machine learning model is useless β garbage in, garbage out.
Every piece of data moves through a lifecycle from creation to consumption. Understanding this lifecycle is the foundation of data engineering.
ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ βββββββββββββββ Generate βββββΆβ Ingest βββββΆβ Store βββββΆβ Transform βββββΆβ Serve ββ β β β β β β β β ββ Applicationsβ β APIs, CDC, β β Data Lake, β β Clean, joinβ β Dashboards,ββ IoT, Logs, β β Streams, β β Warehouse, β β aggregate, β β ML models, ββ Databases β β Batch loadsβ β Lakehouse β β enrich β β APIs, Apps βββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββData is created by applications, users, devices, and systems. Sources include:
Data must be moved from its source into a storage or processing system. The two primary modes are:
| Mode | Description | Latency | Use Case |
|---|---|---|---|
| Batch | Collect data over a period, then process it all at once | Minutes to hours | Daily reports, historical analysis |
| Streaming | Process data continuously as it arrives | Milliseconds to seconds | Fraud detection, real-time dashboards |
| Micro-batch | Small, frequent batches that approximate streaming | Seconds to minutes | Near-real-time analytics |
Where and how data is stored determines what kinds of processing and querying are possible.
Raw data is cleaned, validated, joined, aggregated, and enriched to create datasets that are useful for analysis and decision-making.
Transformed data is made available to consumers β BI dashboards, machine learning models, applications, and APIs.
The choice between batch and stream processing is one of the most fundamental decisions in data engineering.
Batch processing operates on bounded datasets β finite collections of data processed as a unit.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Batch Processing ββ ββ Data accumulated βββΆ Processed as βββΆ Outputββ over a time window a single job ready ββ ββ Example: Process all of yesterday's orders at 2 AM βββββββββββββββββββββββββββββββββββββββββββββββββββββββββCharacteristics:
Common tools: Apache Spark, Apache Hadoop MapReduce, dbt, AWS Glue
Stream processing operates on unbounded datasets β data that arrives continuously with no defined end.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Stream Processing ββ ββ Event 1 βββΆ Process βββΆ Output ββ Event 2 βββΆ Process βββΆ Output ββ Event 3 βββΆ Process βββΆ Output ββ ... ... ... ββ ββ Example: Flag suspicious transactions in real time βββββββββββββββββββββββββββββββββββββββββββββββββββββββββCharacteristics:
Common tools: Apache Kafka, Apache Flink, Apache Spark Streaming, AWS Kinesis
Two prominent architectural patterns address the batch-stream duality:
| Architecture | Description | Pros | Cons |
|---|---|---|---|
| Lambda | Separate batch and speed layers; results are merged | Handles both use cases; batch corrects stream errors | Two codebases to maintain; complex merging logic |
| Kappa | Stream-only; reprocess by replaying the event log | Single codebase; simpler architecture | Reprocessing can be slow; not ideal for all workloads |
Lambda Architecture:ββββββββββββ ββββββββββββββββ Source ββββββΆβ Batch Layer βββββββββββββ β βββββββββββββββ β βββββββββββββββββ β βββββΆβ Serving Layerββ β βββββββββββββββ β βββββββββββββββββ ββββββΆβ Speed Layer ββββββββββββββββββββββββ βββββββββββββββ
Kappa Architecture:ββββββββββββ ββββββββββββββββ βββββββββββββββββ Source ββββββΆβ Stream Layer ββββββΆβ Serving Layerβββββββββββββ ββββββββββββββββ ββββββββββββββββA data engineer is responsible for building and maintaining the infrastructure and frameworks that enable data generation, collection, storage, and analysis. The role sits at the intersection of software engineering, database administration, and data science.
| Responsibility | Description |
|---|---|
| Pipeline Development | Design and build ETL/ELT pipelines that move data from source to destination |
| Data Modeling | Design schemas and data models for warehouses and lakes |
| Infrastructure Management | Set up and maintain data platforms (Spark clusters, Kafka brokers, warehouses) |
| Data Quality | Implement validation, monitoring, and alerting for data pipelines |
| Performance Optimization | Tune queries, partitioning strategies, and storage formats |
| Collaboration | Work with analysts, data scientists, and product teams to understand data needs |
| Aspect | Data Engineer | Data Analyst | Data Scientist | ML Engineer |
|---|---|---|---|---|
| Focus | Infrastructure and pipelines | Insights and reporting | Models and experiments | Model deployment |
| Tools | Spark, Airflow, SQL, Python | SQL, Excel, Tableau | Python, R, Jupyter | TensorFlow, MLflow |
| Output | Reliable data platforms | Dashboards and reports | Predictive models | Production ML systems |
| Skills | Software engineering, databases | Statistics, visualization | Math, ML algorithms | DevOps, ML frameworks |
The modern data stack refers to a collection of cloud-native tools that together form a complete data platform. Each layer handles a specific concern.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ The Modern Data Stack ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β ββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββ β Ingest β β Store β βTransform β β Serve β ββ β β β β β β β β ββ β Fivetran β βSnowflake β β dbt β β Looker β ββ β Airbyte β βBigQuery β β Spark β β Metabase β ββ β Stitch β βRedshift β β β β Superset β ββ β Debezium β βDatabricksβ β β β Hex β ββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββ ββ ββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ ββ β Orchestration β β Observability β ββ β Airflow, Dagster, β β Monte Carlo, Great β ββ β Prefect, Mage β β Expectations, dbt tests β ββ ββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| Layer | Purpose | Popular Tools |
|---|---|---|
| Ingestion | Extract data from sources and load into storage | Fivetran, Airbyte, Stitch, Debezium |
| Storage | Centralized data repository | Snowflake, BigQuery, Redshift, Databricks |
| Transformation | Clean, model, and enrich data | dbt, Apache Spark, Dataform |
| Orchestration | Schedule and coordinate pipeline tasks | Apache Airflow, Dagster, Prefect |
| Serving/BI | Visualize and expose data to stakeholders | Looker, Metabase, Superset, Tableau |
| Observability | Monitor data quality and pipeline health | Monte Carlo, Great Expectations, Elementary |
| Catalog | Discover and document data assets | DataHub, Amundsen, Atlan |
Choosing the right data format significantly impacts performance, storage cost, and query speed.
| Format | Type | Schema | Compression | Best For |
|---|---|---|---|---|
| CSV | Row-based text | None | Poor | Simple data exchange |
| JSON | Semi-structured text | Flexible | Moderate | APIs, nested data |
| Avro | Row-based binary | Embedded | Good | Streaming, Kafka |
| Parquet | Columnar binary | Embedded | Excellent | Analytics, warehouses |
| ORC | Columnar binary | Embedded | Excellent | Hive ecosystem |
| Delta Lake | Columnar + ACID | Embedded | Excellent | Lakehouses |
Here is a minimal pipeline that extracts data from an API, transforms it, and loads it into a database:
import requestsimport pandas as pdfrom sqlalchemy import create_engine
def extract(api_url: str) -> list[dict]: """Extract data from a REST API.""" response = requests.get(api_url, timeout=30) response.raise_for_status() return response.json()
def transform(raw_data: list[dict]) -> pd.DataFrame: """Clean and transform raw data.""" df = pd.DataFrame(raw_data)
# Remove duplicates df = df.drop_duplicates(subset=["id"])
# Standardize column names df.columns = [col.lower().replace(" ", "_") for col in df.columns]
# Parse dates df["created_at"] = pd.to_datetime(df["created_at"])
# Filter out invalid records df = df[df["amount"] > 0]
return df
def load(df: pd.DataFrame, table_name: str, conn_string: str): """Load transformed data into a database.""" engine = create_engine(conn_string) df.to_sql(table_name, engine, if_exists="append", index=False) print(f"Loaded {len(df)} rows into {table_name}")
# Run the pipelineif __name__ == "__main__": raw = extract("https://api.example.com/orders") clean = transform(raw) load(clean, "orders", "postgresql://user:pass@localhost/analytics")const axios = require("axios");const knex = require("knex");
const db = knex({ client: "pg", connection: "postgresql://user:pass@localhost/analytics",});
async function extract(apiUrl) { /** Extract data from a REST API. */ const response = await axios.get(apiUrl, { timeout: 30000 }); return response.data;}
function transform(rawData) { /** Clean and transform raw data. */ // Remove duplicates by id const seen = new Set(); const unique = rawData.filter((row) => { if (seen.has(row.id)) return false; seen.add(row.id); return true; });
// Standardize and filter return unique .filter((row) => row.amount > 0) .map((row) => ({ id: row.id, customer_name: row.customer_name?.toLowerCase(), amount: parseFloat(row.amount), created_at: new Date(row.created_at), }));}
async function load(data, tableName) { /** Load transformed data into a database. */ await db(tableName).insert(data); console.log(`Loaded ${data.length} rows into ${tableName}`);}
// Run the pipeline(async () => { const raw = await extract("https://api.example.com/orders"); const clean = transform(raw); await load(clean, "orders"); await db.destroy();})();ETL Pipelines
Learn about ETL vs ELT patterns, pipeline design, orchestration with Apache Airflow, transformations with dbt, and scheduling strategies.
Data Warehousing
Understand star and snowflake schemas, dimensional modeling, OLAP vs OLTP, data lakes, and the lakehouse paradigm.
Stream Processing
Dive into Apache Spark Streaming, Flink, windowing strategies, exactly-once semantics, and watermarks.
Data Quality
Master data validation, Great Expectations, data contracts, lineage, governance, and observability.