Understanding Kafka Topics

1. Introduction

In this tutorial, you will learn about Kafka topics and their role in event-driven architectures. Kafka topics serve as storage for published messages and allow multiple microservices to consume events efficiently. Kafka is widely used in distributed systems for real-time data streaming and event-driven applications.

2. What is Kafka Topic?

A Kafka topic is a logical storage unit where Kafka stores all published messages. Topics enable microservices to communicate asynchronously by publishing and consuming events. They help decouple producers and consumers, allowing systems to scale independently.

Example Scenario

Consider a Products Microservice that publishes events when a new product is created. Several microservices, such as:

SMS Notification Microservice: This microservice listens to events from the Kafka topic and sends SMS notifications to users whenever a new product is added.
Email Notification Microservice: This microservice subscribes to the topic and generates email notifications for registered users.
Push Notification Microservice: This microservice receives events from Kafka and sends push notifications to mobile applications to inform users about new products.

Instead of sending events directly to these microservices, the Products Microservice publishes the event to a Kafka topic. Each interested microservice then consumes the event from the topic at its own pace, ensuring reliable event processing.

3. Kafka Topics and Partitions

Kafka topics are divided into partitions, and each partition is replicated across multiple Kafka servers for fault tolerance. This ensures high availability and resilience in case of server failures. If one server fails, the event data remains accessible from other servers.

Benefits of Partitioning

Parallel Processing: Multiple instances of the same microservice can read from different partitions, improving scalability and performance by distributing the load among multiple consumers.
Load Distribution: Events are distributed across partitions, allowing better resource utilization and ensuring that no single consumer is overwhelmed with too many events.
Fault Tolerance: Replicating partitions across multiple Kafka brokers ensures that data remains available even in case of hardware failures.
Efficient Scaling: By increasing the number of partitions, the system can handle higher loads and distribute processing tasks across multiple consumers efficiently.

For example, if a topic has three partitions, different microservices can consume data from separate partitions simultaneously, increasing throughput and reducing latency.

4. Naming and Managing Topics

Every topic has a unique name (e.g., product-created-event-topic). The topic name should be meaningful and describe the type of events it stores.
A topic’s partition count is defined during creation and can be increased later but cannot be decreased due to Kafka’s immutable nature.

Each partition is physically stored on a Kafka broker’s hard disk as a sequence of messages, ensuring durability and persistence.

Kafka provides various configuration options for topics, such as retention policies, replication factors, and compression settings, which can be adjusted based on system requirements.

5. Kafka Topic Data Structure

Each partition consists of ordered events stored in sequence, similar to rows in a database table. Kafka topics use a log-based storage mechanism where messages are appended to the end of the log and read sequentially by consumers.

Understanding Offsets

Each event in a partition has a unique offset, starting from 0. The offset acts as an identifier and helps consumers track their position in the partition. New events are always appended at the end, and offsets are immutable:

Offset 0 → First event in the partition.
Offset 1 → Second event in the partition.
Offset N → Latest event stored in the partition.

Unlike traditional databases, Kafka does not allow updates or deletions of events once they are stored. This ensures a clear and reliable event history.

6. Message Retention in Kafka

Messages persist in Kafka topics even after being consumed, allowing multiple consumers to process the same data independently.
By default, messages are retained for 7 days, but this value can be configured using topic properties like log.retention.hours or log.retention.bytes.
Retention policies can be customized based on business needs, allowing messages to be stored indefinitely or deleted after a specified period.
Kafka also supports compacted topics, where only the latest version of each key is retained, useful for maintaining stateful information like user profiles.

7. If I use a key, will messages still go to the same partition after adding more partitions?

No, if you add more partitions to a Kafka topic after you’ve started producing messages using a key, the message may not be written to the same partition as before.

Here’s a detailed explanation:

7.1. How Kafka Determines Partition from Key

When a message is sent with a key, Kafka uses the following logic to determine the partition:

partition = hash(key) % number_of_partitions

That is:

Kafka takes the hash of the key.
Then it calculates the modulo with the current number of partitions in the topic.
This determines the partition number.

7.2. What Happens When You Add More Partitions Later

If your topic originally had 3 partitions, and you were sending messages with key "user123":

partition = hash("user123") % 3  --> Let's say this was partition 1

Now, if you increase the number of partitions to 5:

partition = hash("user123") % 5  --> This might now be partition 4

Result: Messages with the same key "user123" may now go to a different partition.

7.3. Impact on Ordering Guarantees

If ordering per key is important (e.g., for user123 all messages must go in sequence), then adding partitions breaks this guarantee unless you’re very careful.

Kafka does not reassign older messages to new partitions. So:

Older messages with key "user123" are in partition 1.
Newer messages go to partition 4 (after partition count increased).
Consumer will now read messages for "user123" from two different partitions, breaking ordering.