Skip to content

Apache Kafka

Apache Kafka is a distributed event streaming platform used by thousands of companies for high-throughput, fault-tolerant, real-time data pipelines and event-driven architectures. Originally built at LinkedIn, Kafka handles trillions of events per day at companies like Netflix, Uber, and Airbnb.

Unlike traditional message queues that delete messages after consumption, Kafka retains messages in a durable, ordered log. This enables multiple consumers to read the same data, replay events, and build real-time and batch processing pipelines.


Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│ Kafka Cluster │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Broker 0 │ │ Broker 1 │ │ Broker 2 │ │
│ │ │ │ │ │ │ │
│ │ Topic A │ │ Topic A │ │ Topic A │ │
│ │ Partition 0│ │ Partition 1│ │ Partition 2│ │
│ │ (Leader) │ │ (Leader) │ │ (Leader) │ │
│ │ │ │ │ │ │ │
│ │ Topic A │ │ Topic A │ │ Topic A │ │
│ │ Partition 1│ │ Partition 0│ │ Partition 0│ │
│ │ (Replica) │ │ (Replica) │ │ (Replica) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ ZooKeeper / KRaft │ │
│ │ (cluster metadata and coordination) │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
▲ │
│ ▼
┌─────────┐ ┌──────────────┐
│Producers│ │ Consumers │
│ │ │ (Consumer │
│ App A │ │ Groups) │
│ App B │ │ App X │
└─────────┘ │ App Y │
└──────────────┘

Core Components

ComponentDescription
BrokerA single Kafka server that stores data and serves clients
ClusterA group of brokers working together
TopicA named feed of messages (like a database table)
PartitionAn ordered, immutable sequence of messages within a topic
ProducerAn application that publishes messages to topics
ConsumerAn application that reads messages from topics
Consumer GroupA set of consumers that cooperatively consume from a topic
ZooKeeper/KRaftManages cluster metadata, broker coordination, leader election

Topics and Partitions

Topics

A topic is a logical channel for messages. Topics are multi-subscriber: multiple consumer groups can read from the same topic independently.

Partitions

Each topic is divided into partitions. Partitions are the unit of parallelism and are distributed across brokers.

Topic "orders" with 3 partitions:
Partition 0: [msg0] [msg3] [msg6] [msg9] ──▶ offset
Partition 1: [msg1] [msg4] [msg7] ──▶ offset
Partition 2: [msg2] [msg5] [msg8] [msg10] ──▶ offset
Messages are appended to partitions in order.
Each message has a unique offset within its partition.
Messages are NOT ordered across partitions.
Offset: A sequential ID for each message within a partition.
Partition 0: offset 0, 1, 2, 3, ...
Partition 1: offset 0, 1, 2, ...

Partition Assignment

How does Kafka decide which partition a message goes to?

1. Explicit partition: Producer specifies partition number
2. Key-based: partition = hash(key) % num_partitions
Same key → same partition → guaranteed ordering per key
3. Round-robin: No key → distribute evenly across partitions

Replication

Each partition is replicated across multiple brokers for fault tolerance:

Topic "orders", Partition 0, Replication Factor = 3:
Broker 0: Partition 0 (LEADER) ◀── All reads/writes
Broker 1: Partition 0 (FOLLOWER) ◀── Replicates from leader
Broker 2: Partition 0 (FOLLOWER) ◀── Replicates from leader
If Broker 0 fails:
Broker 1 or Broker 2 is elected new leader.
No data loss (if in-sync replicas are up to date).
ISR (In-Sync Replicas): Followers that are caught up
with the leader. Only ISR members can be elected leader.

Producers

Producers publish messages to Kafka topics. Key configuration options affect durability, throughput, and latency.

Producer Acknowledgments (acks)

acks=0: "Fire and forget"
Producer ──▶ Broker
(No acknowledgment. Fastest. Data loss possible.)
acks=1: "Leader acknowledged"
Producer ──▶ Broker (Leader)
Producer ◀── ACK (after leader writes to disk)
(If leader crashes before replication, data loss.)
acks=all (-1): "All in-sync replicas acknowledged"
Producer ──▶ Broker (Leader) ──▶ Follower 1 ──▶ Follower 2
Producer ◀── ACK (after all ISR replicas write)
(Strongest durability. Highest latency.)
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda v: json.dumps(v).encode(),
key_serializer=lambda k: k.encode() if k else None,
acks='all', # Wait for all replicas
retries=3, # Retry on failure
batch_size=16384, # Batch messages (16KB)
linger_ms=10, # Wait up to 10ms to batch
compression_type='gzip',
enable_idempotence=True # Exactly-once producer
)
# Send a message with a key
# Key determines partition: same key → same partition
def publish_order(order):
future = producer.send(
'orders',
key=order['customer_id'],
value=order,
headers=[
('source', b'order-service'),
('version', b'1.0')
]
)
# Block until message is sent (or use callback)
metadata = future.get(timeout=10)
print(
f"Sent to partition {metadata.partition} "
f"at offset {metadata.offset}"
)
# Publish orders
publish_order({
'order_id': 'ORD-001',
'customer_id': 'CUST-42',
'items': ['widget', 'gadget'],
'total': 149.99
})
# Flush any buffered messages
producer.flush()
producer.close()

Consumers and Consumer Groups

Consumer Groups

A consumer group is a set of consumers that cooperatively read from a topic. Each partition is assigned to exactly one consumer within the group.

Topic "orders" with 4 partitions:
Consumer Group "order-processing":
Partition 0 ──▶ Consumer A
Partition 1 ──▶ Consumer A
Partition 2 ──▶ Consumer B
Partition 3 ──▶ Consumer C
3 consumers share 4 partitions.
Consumer A handles 2 partitions.
If Consumer B crashes:
Partition 2 is reassigned (rebalanced) to A or C.
Adding Consumer D:
Partitions are rebalanced across A, B, C, D
(one partition per consumer).
Adding Consumer E (5 consumers > 4 partitions):
Consumer E sits idle (no partition to consume from).

Offset Management

Each consumer tracks its position in each partition using an offset:

Partition 0: [0] [1] [2] [3] [4] [5] [6] [7] [8]
▲ ▲
│ │
committed offset current position
(last processed) (reading here)
If consumer crashes and restarts:
Resumes from committed offset (3).
Messages 4-8 are reprocessed.
If using auto-commit:
Offsets committed periodically (default: 5s).
Messages between last commit and crash are reprocessed.
If using manual commit:
You control exactly when offsets are committed.
More control but more responsibility.
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'orders',
bootstrap_servers=['localhost:9092'],
group_id='order-processing',
auto_offset_reset='earliest',
enable_auto_commit=False,
value_deserializer=lambda m: json.loads(
m.decode('utf-8')
)
)
try:
for message in consumer:
print(
f"Partition: {message.partition}, "
f"Offset: {message.offset}, "
f"Key: {message.key}"
)
order = message.value
print(f"Processing order: {order['order_id']}")
# Process the message
process_order(order)
# Manual commit after successful processing
consumer.commit()
except KeyboardInterrupt:
pass
finally:
consumer.close()
def process_order(order):
print(f"Order {order['order_id']} processed")

Kafka Streams

Kafka Streams is a client library for building stream processing applications that transform, aggregate, and join data from Kafka topics.

Input Topic ──▶ Kafka Streams Application ──▶ Output Topic
Stream Processing Operations:
- filter: Remove unwanted messages
- map: Transform messages
- flatMap: One-to-many transformation
- groupBy: Group messages by key
- aggregate/count/reduce: Stateful operations
- join: Combine streams by key within a time window
- windowed operations: Process data in time windows

Stream Processing Example

Use Case: Real-time order analytics
Input: "orders" topic
{ orderId: "ORD-1", category: "electronics", amount: 299 }
{ orderId: "ORD-2", category: "books", amount: 29 }
{ orderId: "ORD-3", category: "electronics", amount: 599 }
Processing:
1. Filter: Only orders > $50
2. Group by: category
3. Aggregate: Sum amounts per category in 5-min windows
Output: "order-analytics" topic
{ category: "electronics", window: "14:00-14:05", total: 898 }
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.common.serialization.*;
import java.time.Duration;
import java.util.Properties;
public class OrderAnalytics {
public static void main(String[] args) {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG,
"order-analytics");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG,
Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG,
Serdes.String().getClass());
StreamsBuilder builder = new StreamsBuilder();
// Read from "orders" topic
KStream<String, String> orders =
builder.stream("orders");
// Process: filter, group, aggregate
KTable<Windowed<String>, Double> analytics =
orders
// Deserialize JSON
.mapValues(OrderAnalytics::parseOrder)
// Filter orders > $50
.filter((key, order) ->
order.amount() > 50)
// Group by category
.groupBy((key, order) -> order.category())
// 5-minute tumbling window
.windowedBy(
TimeWindows.ofSizeWithNoGrace(
Duration.ofMinutes(5)
)
)
// Sum amounts
.aggregate(
() -> 0.0,
(key, order, total) ->
total + order.amount(),
Materialized.with(
Serdes.String(),
Serdes.Double()
)
);
// Write results to output topic
analytics.toStream()
.map((windowedKey, total) ->
KeyValue.pair(
windowedKey.key(),
String.format(
"{\"category\":\"%s\"," +
"\"total\":%.2f}",
windowedKey.key(), total
)
))
.to("order-analytics");
// Start the stream processing
KafkaStreams streams =
new KafkaStreams(builder.build(), props);
streams.start();
Runtime.getRuntime().addShutdownHook(
new Thread(streams::close)
);
}
record Order(
String orderId, String category, double amount
) {}
static Order parseOrder(String json) {
// Parse JSON to Order (simplified)
return new Order("ORD-1", "electronics", 299.0);
}
}

Exactly-Once Semantics

Kafka provides exactly-once semantics (EOS) through three mechanisms:

1. Idempotent Producer:
Producer assigns sequence numbers to messages.
Broker deduplicates by (producerId, sequence).
Duplicate sends (from retries) are silently dropped.
2. Transactional Producer:
Atomic writes across multiple partitions.
Either ALL messages in a transaction are committed,
or NONE are.
3. Consumer read_committed:
Consumer only reads messages from committed transactions.
Uncommitted/aborted messages are invisible.

End-to-End Exactly-Once

Exactly-once from consume to produce:
Consumer reads from input topic
├── Process message
├── Produce to output topic ┐
└── Commit consumer offset ┘ ATOMIC (transactional)
If ANY step fails, the entire transaction is aborted.
On retry, the consumer rereads and reprocesses.
But the output is not duplicated (transaction was aborted).

Kafka Use Cases

Use CaseDescriptionExample Companies
Event streamingReal-time event pipelines between systemsNetflix, LinkedIn
Log aggregationCollect logs from all services into a central platformAll large tech companies
Metrics collectionReal-time operational metricsTwitter, Datadog
Stream processingReal-time data transformation and analyticsUber (surge pricing)
Event sourcingStore all state changes as eventsFinancial systems
CDC (Change Data Capture)Replicate database changes to other systemsDebezium + Kafka
Microservice communicationAsync communication between servicesMost microservice architectures

Kafka Operations Essentials

Terminal window
# Create a topic
kafka-topics.sh --create \
--bootstrap-server localhost:9092 \
--topic orders \
--partitions 6 \
--replication-factor 3
# List topics
kafka-topics.sh --list \
--bootstrap-server localhost:9092
# Describe a topic
kafka-topics.sh --describe \
--bootstrap-server localhost:9092 \
--topic orders
# Produce messages from CLI
kafka-console-producer.sh \
--bootstrap-server localhost:9092 \
--topic orders \
--property key.separator=: \
--property parse.key=true
# Consume messages from CLI
kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic orders \
--from-beginning \
--group test-consumer
# Check consumer group lag
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--describe \
--group order-processing

Summary

ConceptKey Takeaway
Topics & PartitionsTopics are split into ordered partitions for parallelism
ReplicationEach partition is replicated across brokers for fault tolerance
Consumer GroupsConsumers in a group share partitions for parallel processing
OffsetsTrack consumer position; enable replay and exactly-once
acks=allStrongest durability guarantee for producers
Kafka StreamsClient library for stateful stream processing
Exactly-OnceAchieved through idempotent producers and transactions
RetentionMessages persist for a configurable period (not deleted on consumption)