Skip to content

Indexing & Query Performance

What Is an Index?

An index is a separate data structure that the database maintains alongside your table, designed to speed up data retrieval. Think of it like the index at the back of a textbook — instead of reading every page to find a topic, you look it up in the index and jump directly to the right page.

Without an index, the database must perform a full table scan — reading every row to find the ones matching your query. For a table with 10 million rows, this is extremely slow.

With an index, the database can locate matching rows in logarithmic time, typically examining only a handful of pages instead of millions.

Without Index (Full Table Scan) With Index (B-tree Lookup)
┌──────────────────────────┐ ┌──────────────────────────┐
│ Row 1 → check │ │ [M] │
│ Row 2 → check │ │ / \ │
│ Row 3 → check │ │ [D] [R] │
│ Row 4 → check │ │ / \ / \ │
│ ... │ │ [A] [G] [N] [W] │
│ Row N → check │ │ ↓ ↓ ↓ ↓ │
│ │ │ → Direct to matching │
│ Time: O(N) │ │ rows on disk │
│ 10M rows = 10M checks │ │ Time: O(log N) │
└──────────────────────────┘ │ 10M rows ≈ 3-4 lookups │
└──────────────────────────┘

How B-tree Indexes Work

The B-tree (balanced tree) is the default and most common index type in virtually all relational databases (PostgreSQL, MySQL, SQL Server, Oracle).

Structure

A B-tree is a self-balancing tree where:

  • Each node contains multiple keys in sorted order
  • Internal nodes contain pointers to child nodes
  • Leaf nodes contain pointers to actual table rows
  • All leaf nodes are at the same depth (balanced)
┌──────────────────┐
│ [35, 70] │ ← Root node
└──┬─────┬─────┬───┘
________│ │ │________
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│ [10, 20] │ │ [42, 55] │ │ [80, 95] │ ← Internal nodes
└─┬──┬──┬──┘ └─┬──┬──┬──┘ └─┬──┬──┬──┘
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Leaf nodes → pointers to actual rows on disk
Looking up value 55:
1. Root: 55 > 35, 55 < 70 → go to middle child
2. Internal: 55 = 55 → found! Follow pointer to row
3. Total comparisons: 2 (instead of scanning all rows)

Why B-trees Excel

OperationB-tree PerformanceFull Scan Performance
Exact match (WHERE id = 42)O(log N)O(N)
Range query (WHERE price > 50 AND price < 100)O(log N + K)O(N)
Sorted output (ORDER BY name)Already sortedO(N log N) sort needed
Min/Max (MIN(price))O(log N)O(N)

Where N is total rows and K is the number of matching rows.

Hash Indexes

Hash indexes use a hash function to map keys directly to row locations. They are extremely fast for exact-match lookups but cannot handle range queries.

Hash Function: h(key) → bucket
┌──────────────────────────────────┐
│ Key: "alice@example.com" │
│ h("alice@example.com") → 7 │
│ │
│ Bucket 7 → Row pointer │
│ Lookup: O(1) average │
└──────────────────────────────────┘
FeatureB-tree IndexHash Index
Exact matchO(log N)O(1) average
Range queriesSupportedNot supported
SortingSupportedNot supported
Prefix matchingSupported (LIKE 'abc%')Not supported
Default in most DBsYesNo (explicit in PostgreSQL)

Types of Indexes

Primary Key Index

Automatically created when you define a primary key. In InnoDB (MySQL), the primary key index is the table — data is stored in primary key order (a clustered index).

CREATE TABLE users (
user_id SERIAL PRIMARY KEY, -- Index automatically created
name VARCHAR(100)
);

Secondary Index

Any index other than the primary key. Stores a copy of the indexed column(s) plus a pointer back to the full row.

CREATE INDEX idx_users_email ON users (email);

Composite (Multi-Column) Index

Indexes multiple columns together. The column order matters significantly.

CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date);

This index is efficient for:

-- Uses the index (leftmost prefix)
WHERE customer_id = 42
WHERE customer_id = 42 AND order_date > '2024-01-01'
-- Cannot efficiently use this index
WHERE order_date > '2024-01-01' -- skips the first column

Covering Index

An index that contains all the columns a query needs, so the database never has to read the actual table row — it can answer the query entirely from the index.

-- Query that needs customer_id, order_date, and status
SELECT order_date, status
FROM orders
WHERE customer_id = 42;
-- Covering index for this query
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date, status);
-- The database can satisfy the entire query from the index alone

Partial (Filtered) Index

Indexes only a subset of rows, making the index smaller and more efficient for specific query patterns.

-- Only index active orders (most queries filter on active orders)
CREATE INDEX idx_active_orders
ON orders (customer_id, order_date)
WHERE status = 'active';
-- This index helps:
SELECT * FROM orders WHERE status = 'active' AND customer_id = 42;
-- This index does NOT help:
SELECT * FROM orders WHERE status = 'completed' AND customer_id = 42;

Unique Index

Enforces uniqueness on the indexed columns. Automatically created for UNIQUE and PRIMARY KEY constraints.

CREATE UNIQUE INDEX idx_users_email_unique ON users (email);
-- Attempting to insert a duplicate email will raise an error

Index Types Summary

Index TypeUse CaseKey Benefit
Primary KeyRow identityAutomatic, clustered (InnoDB)
SecondarySpeed up queries on non-PK columnsFaster lookups
CompositeQueries filtering/sorting on multiple columnsMulti-column optimization
CoveringQueries reading only indexed columnsAvoids table row lookups
PartialQueries targeting a subset of rowsSmaller, faster index
UniqueEnforcing uniquenessConstraint + performance

When to Create Indexes

Create indexes on:

  • Columns used frequently in WHERE clauses
  • Columns used in JOIN conditions (foreign keys)
  • Columns used in ORDER BY and GROUP BY
  • Columns with high selectivity (many distinct values)

Avoid excessive indexes because:

  • Each index slows down INSERT, UPDATE, and DELETE operations
  • Indexes consume additional disk space
  • The query planner may become confused with too many index options

Index Selectivity

Selectivity = number of distinct values / total number of rows. Higher selectivity means the index is more useful.

-- High selectivity (good for indexing)
SELECT COUNT(DISTINCT email) / COUNT(*)::float AS selectivity
FROM users;
-- Result: 1.0 (every email is unique → perfect selectivity)
-- Low selectivity (poor for indexing)
SELECT COUNT(DISTINCT status) / COUNT(*)::float AS selectivity
FROM orders;
-- Result: 0.0003 (only 3 distinct statuses across 10,000 rows)

Reading EXPLAIN Plans

The EXPLAIN command reveals how the database plans to execute your query. It is the single most important tool for diagnosing query performance.

PostgreSQL EXPLAIN

EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id) AS total_orders
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'New York'
GROUP BY c.customer_id, c.name;
Hash Left Join (cost=1.05..2.18 rows=2 width=40) (actual time=0.045..0.052 rows=2 loops=1)
Hash Cond: (o.customer_id = c.customer_id)
-> Seq Scan on orders o (cost=0.00..1.03 rows=3 width=8) (actual time=0.003..0.004 rows=3 loops=1)
-> Hash (cost=1.04..1.04 rows=1 width=36) (actual time=0.020..0.021 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on customers c (cost=0.00..1.04 rows=1 width=36) (actual time=0.010..0.013 rows=2 loops=1)
Filter: ((city)::text = 'New York'::text)
Rows Removed by Filter: 2

Key things to look for:

TermMeaningGood or Bad?
Seq ScanFull table scan (reads every row)Bad on large tables
Index ScanUses an index to find rowsGood
Index Only ScanAnswers query entirely from index (covering)Best
Bitmap Index ScanUses index to build a bitmap, then reads tableGood for many matches
Hash JoinBuilds hash table from one side, probes from otherGood for equality joins
Nested LoopFor each row in outer, scan innerGood for small datasets
SortSorting operation (check if index could avoid this)Depends on size
actual timeReal execution time in millisecondsLower is better
rowsEstimated vs actual row countShould be close

Before and After: Adding an Index

Scenario: Finding all orders for a specific customer on a table with 1,000,000 rows.

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 42;
-- Result:
-- Seq Scan on orders (cost=0.00..18334.00 rows=125 width=44)
-- (actual time=0.215..98.453 rows=118 loops=1)
-- Filter: (customer_id = 42)
-- Rows Removed by Filter: 999882
-- Planning Time: 0.089 ms
-- Execution Time: 98.512 ms

Problem: Sequential scan reads all 1,000,000 rows, removing 999,882. Takes ~98ms.

Query Optimization Techniques

1. Avoid SELECT *

-- Bad: fetches all columns, even those you do not need
SELECT * FROM orders WHERE customer_id = 42;
-- Good: fetch only what you need (enables covering index)
SELECT order_id, order_date, status
FROM orders WHERE customer_id = 42;

2. Use WHERE on Indexed Columns

-- Bad: function on indexed column prevents index use
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- Good: use a functional index or store normalized data
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
-- Or store email in lowercase and query directly
SELECT * FROM users WHERE email = 'alice@example.com';

3. Optimize JOINs

-- Ensure JOIN columns are indexed
CREATE INDEX idx_order_items_order ON order_items (order_id);
CREATE INDEX idx_order_items_product ON order_items (product_id);
-- Join on indexed columns
SELECT o.order_id, p.name, oi.quantity
FROM orders o
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
WHERE o.customer_id = 42;

4. Use LIMIT for Pagination

-- Bad: offset-based pagination degrades for large offsets
SELECT * FROM products ORDER BY product_id LIMIT 20 OFFSET 100000;
-- Database must scan and discard 100,000 rows
-- Good: keyset (cursor) pagination
SELECT * FROM products
WHERE product_id > 100000 -- last seen ID from previous page
ORDER BY product_id
LIMIT 20;
-- Database jumps directly to the right position via index

5. Avoid OR on Different Columns

-- Bad: often prevents index use
SELECT * FROM users WHERE email = 'alice@example.com' OR city = 'New York';
-- Better: use UNION to allow each branch to use its own index
SELECT * FROM users WHERE email = 'alice@example.com'
UNION
SELECT * FROM users WHERE city = 'New York';

The N+1 Query Problem

The N+1 problem is one of the most common performance issues in applications using ORMs. It occurs when your code executes 1 query to fetch a list, then N additional queries to fetch related data for each item.

N+1 Problem Example:
Query 1: SELECT * FROM authors; -- Returns 100 authors
Query 2: SELECT * FROM books WHERE author_id = 1; -- For author 1
Query 3: SELECT * FROM books WHERE author_id = 2; -- For author 2
...
Query 101: SELECT * FROM books WHERE author_id = 100; -- For author 100
Total: 101 queries for what should be 1 or 2 queries!

Solutions:

-- One query with a JOIN — no N+1 problem
SELECT a.name, b.title, b.published
FROM authors a
LEFT JOIN books b ON a.author_id = b.author_id
ORDER BY a.name;

Connection Pooling

Opening a database connection is expensive (TCP handshake, authentication, memory allocation). Connection pooling maintains a pool of reusable connections, dramatically reducing overhead.

Without Pooling: With Pooling:
┌──────────┐ ┌──────────┐
│ Request 1│─── open ──── close │ Request 1│──┐
│ Request 2│─── open ──── close │ Request 2│──┤ ┌──────────────┐
│ Request 3│─── open ──── close │ Request 3│──┼──│ Connection │──── Database
│ Request 4│─── open ──── close │ Request 4│──┤ │ Pool (5-20 │
│ Request 5│─── open ──── close │ Request 5│──┘ │ connections) │
└──────────┘ └──────────┘ └──────────────┘
5 connections opened/closed Connections reused across requests
~50ms overhead each ~0ms overhead after initial setup

Common connection pool settings:

SettingDescriptionTypical Value
min_connectionsMinimum idle connections kept open2-5
max_connectionsMaximum total connections allowed10-20 per app instance
idle_timeoutClose idle connections after this duration300 seconds
connection_timeoutMax time to wait for an available connection30 seconds
from sqlalchemy import create_engine
engine = create_engine(
"postgresql://user:pass@localhost/mydb",
pool_size=10, # max connections in pool
max_overflow=5, # extra connections beyond pool_size
pool_timeout=30, # wait time for available connection
pool_recycle=1800, # recycle connections after 30 minutes
)

Performance Optimization Checklist

Use this checklist when diagnosing slow queries:

  1. Run EXPLAIN ANALYZE on the slow query
  2. Look for Seq Scans on large tables — add indexes for filtered columns
  3. Check estimated vs actual rows — large discrepancies mean stale statistics (ANALYZE the table)
  4. Verify indexes are being used — functions on columns, implicit type casts, and OR conditions can prevent index use
  5. Check for N+1 queries — enable query logging and look for repeated similar queries
  6. Consider covering indexes — eliminate table lookups entirely
  7. Review connection pooling — ensure connections are reused, not opened per request
  8. Monitor slow query logs — PostgreSQL: log_min_duration_statement, MySQL: slow_query_log

Next Steps