Learnitweb

Kafka Producer Retry Mechanism and Configuration in Spring Boot

In a previous lesson, you learned how Kafka producers can be configured to wait for acknowledgments from brokers to ensure the durability and reliability of messages. This is particularly useful in distributed systems where message loss is unacceptable.

Now, let’s take it a step further by discussing what happens when a broker becomes unavailable, and how Kafka handles retries. We’ll also cover the most important configuration properties that control this behavior, especially in a Spring Boot application using Spring Kafka.

Understanding Kafka Producer Retries

When a Kafka producer sends a message, it typically waits for an acknowledgment from the Kafka broker indicating that the message was successfully received and stored in a partition.

Now, consider this scenario:

You’ve configured your Kafka topic with a replication factor of 3, and you’ve set the Kafka producer’s acknowledgment mode (acks) to all. This means that the message must be acknowledged not just by the leader, but also by all in-sync replicas (ISRs).

You’ve also configured the min.insync.replicas property to 3, ensuring that a message is only considered committed if all three replicas (the leader and two followers) are in sync and confirm the message storage.

But what happens if one of the followers becomes unavailable? In such a case, Kafka producer will not be able to receive acknowledgments from all in-sync replicas immediately. So now comes the role of retries.

Error Handling in Kafka Producer

There are three possible scenarios when a Kafka producer sends a message:

  1. No acknowledgment is expected (when acks=0):
    In this case, the producer sends the message and moves on without waiting for any response. This is the fastest option but offers no delivery guarantees.
  2. Acknowledgment is received (when acks=1 or acks=all):
    The broker sends back a response indicating whether the message was stored successfully. If the response is positive, the producer proceeds.
  3. An error is encountered:
    • Non-retriable errors: These are permanent issues like message size exceeding the configured maximum. Retrying will not help in this case. The message is immediately marked as failed.
    • Retriable errors: These are temporary problems such as network glitches, broker unavailability, or lack of in-sync replicas. Kafka producer will retry sending the message if it encounters these.

Kafka Producer Retry Configuration Properties

To manage these retries effectively, Kafka provides a set of producer configuration properties. Here are the most important ones, along with detailed explanations:

1. retries

This property controls the maximum number of retry attempts the producer should make when a retriable error occurs.

  • Default Value: Integer.MAX_VALUE (effectively retry indefinitely until the message is delivered or delivery timeout is reached)
  • Recommended Usage: Set a reasonable limit, such as 10, if you want to prevent indefinite retries.
  • Use Case Example: If a broker goes down temporarily, Kafka can retry a few times before giving up.
spring.kafka.producer.retries=10

2. retry.backoff.ms

This property defines the time (in milliseconds) the producer should wait before attempting a retry. This delay prevents the producer from flooding the broker with retry requests in a tight loop.

  • Default Value: 100 ms
  • Recommended Usage: Increase this value if your application is sensitive to load or network issues. For example, 1000 ms to wait 1 second between retries.
spring.kafka.producer.retry-backoff=1000

3. delivery.timeout.ms

This is a key configuration that defines the total time the producer is allowed to spend on sending a single message. It includes the time for all retries, time waiting for acknowledgments, and any backoff time.

  • Default Value: 120000 ms (2 minutes)
  • Important Note: This is the upper bound for all retries combined. If the message is not delivered within this time, a timeout exception is thrown.
  • Best Practice: Always ensure this value is greater than or equal to request.timeout.ms + (retries * retry.backoff.ms).
spring.kafka.producer.delivery-timeout=120000

4. request.timeout.ms

This property controls how long the producer waits for a response from the Kafka broker after sending a request.

  • Default Value: 30000 ms (30 seconds)
  • Clarification: This is different from delivery.timeout.ms which covers the full duration of retries; this one is per individual attempt.
spring.kafka.producer.request-timeout=30000

5. linger.ms

This property controls how long the producer will wait before sending a batch of messages. By default, it’s set to 0, which means the producer sends each message immediately.

  • Use Case: If you’re using compression, batching can significantly improve performance by compressing multiple messages at once.
  • Effect on Delivery Timeout: If you increase this value, remember that it affects delivery.timeout.ms total duration.
spring.kafka.producer.linger-ms=0