What is Kafka Connect?

Kafka Connect is a free, open-source component of Apache Kafka® that serves as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems. You can use Kafka Connect to stream data between Apache Kafka® and other data systems and quickly create connectors that move large data sets in and out of Kafka.

In a typical enterprise, multiple independent applications operate simultaneously. Some of these are custom-built in-house, while others are acquired from third-party vendors. Additionally, certain applications might be hosted externally, managed by partners or service providers. Each of these systems generates and owns its own data. However, they often require additional data maintained by other systems. For instance, financial accounting software depends on data from an invoicing system, while an inventory management system relies on information from invoicing, warehousing, and shipment systems.

Enterprise-wide analytics further necessitate data aggregation from various sources. As a result, data integration becomes a significant challenge. Initially, attempts to address this lead to a growing network of interconnected data pipelines, which can quickly become unmanageable. To tackle this complexity, Linden developed and leveraged Kafka to streamline data pipelines.

Let us think about a problem. The problem is to move data from an application to the Snowflake. You introduce a Kafka Broker in between. Moving the data to the Kafka Broker is a one time activity. Once the data is there in Kafka Broker, you can move it other applications.

Now, how to move data from application to Kafka Broker.

If you have access to the source code of the application and modifying it is practically feasible, you can integrate an embedded Kafka producer using Kafka producer APIs. This embedded producer operates within your application, running internally to send invoices to the Kafka Cluster.

What if you lack access to your application’s source code or modifying it isn’t feasible? How do you send data?

A solution is to create an independent Kafka producer that connects to the source database, retrieves data, and transmits it to the Kafka cluster.

Both options work—it’s your choice to evaluate and decide.

However, if you opt for an independent producer, you might be reinventing the wheel for a problem already solved.

Kafka Connect acts as a bridge between your data source and Kafka Cluster.

Simply configure it to consume data from the source and send it to Kafka—no coding required.

Everything is pre-built; just set it up and let Kafka Connect handle the rest.

To transfer data from Kafka Cluster to Snowflake, use Kafka Connect. Simply configure it, and it will handle reading from Kafka and writing to Snowflake—no hassle!

Set up Kafka Connect between the two systems, configure it, and it will seamlessly transfer data from the Kafka Cluster to the Snowflake database.

The left side of the connector is called Kafka Connect Source connector. And the right side of the connector is called a Kafka Connect Sink Connector.

We use the source connector to pull data from a source system and send it to the Kafka Cluster. The Source Connector will internally use the Kafka producer API. Similarly, we use the Sink connector to consume the data from Kafka topic and Sink it to an external system. These Sink connectors will internally use the Kafka Consumer API.

Kafka Connect Framework

Kafka Connect framework allows you to write connectors. The connectors are implemented in two flavors:

Source Connector
Sink Connector

The Kafka Connect framework handles the complexities of scalability, fault tolerance, error handling, and more, allowing you to focus on building connectors with minimal effort.

As a connector developer, you only need to implement two Java classes:

SourceConnector or SinkConnector – Defines the connector’s behavior.
SourceTask or SinkTask – Handles the actual data movement.

With these, Kafka Connect efficiently manages data integration between Kafka and external systems.

Suppose you need to transfer data from an RDBMS to a Kafka Cluster. Simply choose a suitable source connector, such as the JDBC Source Connector. Install it in Kafka Connect, configure it, and run it—no additional coding required!

To move data from your Kafka Cluster to a Snowflake database, simply use the Snowflake Sink Connector. Install it in Kafka Connect, configure it, run it— and you’re all set!

Scaling Kafka Connect

Kafka Connect operates as a cluster, with each individual unit known as a Connect Worker. Think of it as a network of computers, each running a Kafka Connect Worker, working together to manage data flow.

Multiple SourceTasks can run in parallel, distributing the workload. One task may fetch data from one database table, while another pulls from a different table.

Similarly, multiple SinkTasks can run concurrently to handle data processing. Kafka Connect is highly scalable—adjust the number of tasks and add more workers to increase capacity. You can fine-tune configurations based on your use case and system design.

You can have one Kafka connect Cluster and run as many connectors as you want.

If your Kafka Connect Cluster reaches full capacity, you can scale it effortlessly by adding more workers to the same cluster.

Best of all, this can be done dynamically, without interrupting any running connectors, ensuring seamless data processing.

Kafka Connect Transformations

Kafka Connect is designed for seamless data movement between third-party systems and Kafka, acting as a plain copy mechanism for data transfer.

In both Source and Sink connectors, one side must always be a Kafka Cluster.

However, Kafka Connect also supports Single Message Transformations (SMTs), allowing you to modify each message on the fly before it reaches its destination.

These transformations are available for both Source and Sink connectors. Following are some single message transformations:

Add a new field in your record using static data or metadata.
Filter or rename fields.
Mask some fields with a null value.
Change the record key.
Route the record to a different Kafka Topic.

Kafka Connect Architecture

To understand Kafka Connect Architecture, you need to understand three things:

Worker
Connector
Task

Kafka Connect is a cluster and it runs one or more workers. These workers are fault tolerant, and they use the Group ID to form a Cluster. This Group ID mechanism is the same as Kafka Consumer Groups. Simply start multiple workers with the same group ID, and they will automatically collaborate to form a Kafka Connect Cluster, ensuring seamless scalability and distributed processing.

These workers function like container processes, responsible for starting and running connectors and tasks. The workers are fault-tolerant and self-managed.

If a worker in the Connect Cluster stops or crashes, other workers will detect the failure and automatically reassign the connectors and tasks from the affected worker to the remaining ones.

Similarly, if a new worker joins the cluster, the existing workers will recognize it and redistribute connectors and tasks to maintain a balanced load across the cluster.

Let’s say we want to copy data from a relational database. First, we’d download the JDBC Source Connector and install it within the cluster. The installation process simply ensures that the JAR files and all dependencies are available to the workers. Next, we’ll configure the connector by providing necessary information, such as database connection details, a list of tables to copy, how often to pull data, the maximum number of tasks, and other configuration parameters based on the connector’s requirements. This configuration is placed in a file, and we can start the connector using a command-line tool or even via Kafka Connect’s REST API.

Once the connector has split the work into tasks, it doesn’t directly copy the data; instead, it creates a list of tasks. Each task is configured to read data from a specific table and will include additional configurations like database connection details. These tasks are then handed over to the workers, who will distribute them to balance the cluster’s load.

The task’s role is to connect to the source system, poll data at regular intervals, collect the records, and pass them to the worker. The worker, in turn, will send the data to the Kafka Cluster. For Sink tasks, they get the data from the Kafka Cluster and are responsible for inserting the records into the target system.

This design follows a reusable pattern:

The connector class handles how to split data for parallel processing.
The task class handles how to interact with the external system.

Most other activities—like interacting with Kafka, handling configurations, monitoring, scaling, and failure management—are standard and managed by the Kafka Connect framework itself.

This separation of concerns allows the framework to handle common operations, letting the connector developer focus on specific system interactions and parallelism logic.