Learnitweb

Data Manipulation Language (DML) in ClickHouse

Data Manipulation Language, commonly abbreviated as DML, refers to the family of SQL commands that are responsible for inserting, modifying, and removing data stored inside database tables, and in ClickHouse this topic requires special attention because ClickHouse is a columnar analytical database that is optimized primarily for high-volume reads and aggregations rather than frequent row-level modifications. As a result, while DML exists and is fully usable in ClickHouse, it behaves differently from traditional OLTP databases, and understanding these differences is essential for designing efficient systems.

DML commands operate on the data itself rather than on the structure of the database, which distinguishes them from DDL. Whenever you add new records, correct existing values, or remove unwanted data, you are using DML. In analytical workloads, inserts are extremely common, whereas updates and deletes are typically used more carefully due to their cost.

The most commonly used DML operations in ClickHouse are:

  • INSERT
  • UPDATE (implemented as mutations)
  • DELETE (implemented as mutations)
  • Lightweight DELETE

Each of these has specific performance and architectural implications.

1. INSERT — Adding Data to Tables

INSERT is the primary and most performance-friendly DML operation in ClickHouse because the system is designed for fast data ingestion. Analytical databases often ingest large batches of data from logs, streams, or ETL pipelines, and ClickHouse excels at this pattern.

Before inserting data, let us consider a simple table based on the MergeTree engine family, which is required for advanced modifications.

Example table:

CREATE TABLE dml_demo.sample_table
(
    column1 String,
    column2 String
)
ENGINE = MergeTree
ORDER BY column1;

Now, an INSERT statement looks like this:

INSERT INTO dml_demo.sample_table (column1, column2)
VALUES
    ('1', 'A'),
    ('2', 'B');

This statement:

  • Specifies the target table
  • Lists the columns receiving data
  • Provides values for each row

After execution, selecting from the table will show the inserted rows.

In real systems, INSERT often happens in bulk because ClickHouse performs best when data is written in larger batches rather than one row at a time.

2. UPDATE — Modifying Existing Data (Mutations)

ClickHouse does support updates, but they are not simple in-place row updates like in traditional databases. Instead, updates are implemented as mutations, which means the system rewrites data parts behind the scenes.

Important constraints:

  • Updates are supported only for MergeTree family tables
  • They are asynchronous and resource-intensive
  • Frequent updates can increase CPU and disk usage

Syntax example:

ALTER TABLE dml_demo.sample_table
UPDATE column2 = 'Apple'
WHERE column2 = 'A';

What happens internally is not a direct overwrite of a single row. Instead, ClickHouse marks data parts for mutation and rewrites them, which is why this operation is considered expensive.

Because of this design:

  • Updates should be rare
  • Tables should be modeled to minimize corrections
  • Data is often treated as append-only in analytics

3. DELETE — Removing Data (Mutations)

Deletes in ClickHouse are also implemented as mutations and follow a similar cost model as updates.

Example:

ALTER TABLE dml_demo.sample_table
DELETE
WHERE column2 = 'Apple';

This command:

  • Identifies rows using the WHERE clause
  • Marks data for deletion via mutation
  • Triggers background rewriting of data parts

After completion, the matching rows disappear from query results.

However, the cost considerations remain the same:

  • Large deletes are expensive
  • Frequent deletes increase system load
  • Analytical schemas usually avoid row-by-row deletions

4. Lightweight DELETE — A Newer Alternative

Lightweight deletes were introduced to provide a more efficient way to logically remove rows without immediately rewriting data parts.

Key idea:

Rows are masked first and physically removed later.

When a lightweight delete runs, ClickHouse does not instantly remove the row. Instead, it uses a hidden column (commonly referred to as _row_exists) to mark whether a row is valid.

If a row is “deleted”:

  • _row_exists is marked false
  • Queries no longer return the row
  • Physical removal happens during background merges

Syntax:

DELETE FROM dml_demo.sample_table
WHERE column2 = 'B';

This operation:

  • Is available only for MergeTree tables
  • Is cheaper than full mutations
  • Defers physical cleanup to merge processes

After running this, querying the table will show no rows if the deleted rows were the only ones present.

5. Why Mutations Are Costly in ClickHouse

ClickHouse stores data in immutable parts for performance and compression efficiency. This design is excellent for reads but means that modifying data requires rewriting parts rather than editing individual rows.

Thus:

  • INSERT is cheap and fast
  • UPDATE and DELETE require part rewrites
  • Frequent mutations increase hardware usage

Because of this, many ClickHouse deployments follow an append-only strategy, where corrections are handled through versioning or re-ingestion rather than updates.