When using Apache Kafka with Avro for message serialization, the Schema Registry plays a critical role in managing, validating, evolving, and retrieving schemas for the data being sent and received. Here’s an explanation of its role—along with a practical example.
What is a Schema Registry?
A schema registry is a centralized service that stores and manages Avro schemas (definitions of the data structure). It ensures producers and consumers agree on what the structure of messages (fields, types, etc.) should be. This enables safe data exchange and schema evolution without breaking compatibility between different applications or services.
Key Responsibilities
- Schema Storage: Stores versions of all Avro schemas used for Kafka topics.
- Schema Validation: Checks whether new schemas are compatible with previous versions based on configurable compatibility rules (like backward, forward, or full compatibility)
- Schema Versioning: Maintains a history of schema changes (evolutions), enabling consumers to read both old and new versions of data.
- Schema ID referencing: Instead of including the full schema with every Kafka message, only a small schema ID is sent (reducing message payload size). Consumers use this ID to fetch the actual schema from the Schema Registry
- Central “Contract”: Provides a consistent contract for producers and consumers, reducing serialization and deserialization errors, and enabling independent evolution of applications.
Example – Using Schema Registry with Kafka and Avro
Suppose you are building a Kafka-based system that ingests and processes payment transactions:
- The producer (e.g., payments service) sends messages to the
transactions
topic, using Avro serialization. - The consumer (e.g., reporting service) reads messages from the
transactions
topic, using Avro deserialization.
Step-by-step Example
1. Schema Registration
You define an Avro schema for a payment, e.g.:
{ "type": "record", "name": "Payment", "fields": [ {"name": "id", "type": "string"}, {"name": "amount", "type": "double"} ] }
The producer registers this schema with the Schema Registry. The registry returns a schema ID (say, 42).
2. Producing Messages
- When the producer serializes a message, it attaches schema ID 42 (not the whole schema) to each record.
- The actual Kafka message contains the Avro-encoded payload and the schema ID.
3. Schema Evolution
- Later, you might add a field, say
customer_id
, to thePayment
record. - The new schema is registered as a new version in the Schema Registry. The registry checks compatibility with previous versions.
- Messages with the old schema and new schema can coexist in the same topic.
4. Consuming Messages
- The consumer reads a message, sees schema ID 42, and requests the corresponding schema from the registry.
- The consumer deserializes the Avro payload using the retrieved schema, ensuring it always knows the exact record structure.
Why Not Embed Schemas Directly?
While Avro supports embedding a schema in every message, doing so wastes space and isn’t practical for large or high-volume systems. The schema registry instead stores schemas centrally and references them by ID in each Kafka message, keeping payloads small and ensuring consistent schema evolution.