Learnitweb

Elasticsearch Segments

In this tutorial, we are going to talk about segments. This is one of those internal concepts that you do not strictly need to know to use Elasticsearch effectively, but understanding it gives you a much clearer picture of why Elasticsearch behaves the way it does, especially when it comes to performance, updates, and deletions.

Many common questions—such as “Will the inverted index become too large?” or “Does frequent update/delete hurt performance?”—are directly answered by understanding segments.

The Core Problem Segments Solve

Let us start with a simple concern.

  • We have a products index.
  • Elasticsearch builds an inverted index for fast lookups.
  • Over time, we keep inserting millions of documents.
  • We also perform updates and deletes, because Elasticsearch is often used like a database.

Naturally, you might ask:

  • Will the inverted index become too big?
  • Will frequent updates and deletes slow everything down because the inverted index keeps changing?

The answer lies in how Apache Lucene stores data internally.

What Is a Segment?

From Lucene’s perspective, a segment is the fundamental unit of storage and indexing.

A segment is:

  • Self-contained
  • Immutable
  • Effectively a mini index

This means:

  • Each segment has its own inverted index
  • Once a segment is written, it is never modified

This idea of immutability is the key to understanding Elasticsearch performance.

Visualizing Segments in an Index

Imagine we create a products index and start adding documents.

Behind the scenes, Elasticsearch (via Lucene) does not keep appending everything into one massive inverted index. Instead, it creates multiple segments:

  • products_0
  • products_1
  • products_2
  • products_n

Each of these is a segment, and each segment behaves like a mini independent index.

Even though you query the products index as a single logical unit, internally Elasticsearch is querying multiple segments.

How Documents Become Segments

When you insert documents:

  1. New documents are first written to an in-memory buffer.
  2. Periodically—based on time and memory thresholds—this buffer is flushed to disk.
  3. That flush creates a new segment on disk.
  4. Along with the segment, a corresponding inverted index (term dictionary + posting lists) is written.

You can think of each segment as a file on disk containing:

  • The document data
  • The inverted index for just those documents

This design makes writes extremely fast, because Lucene is mostly doing sequential writes and never modifying existing files.

Why Segments Are Immutable

The statement “segments are immutable” often causes confusion.

It simply means:

  • Once a segment is written, it is never changed
  • No documents are updated inside a segment
  • No documents are physically removed from a segment

This immutability is intentional and provides excellent write performance.

Trade-off: Fast Writes, Slower Searches

This design has a clear trade-off.

Advantage: Very Fast Writes

  • Elasticsearch does not need to rebalance or rewrite a huge inverted index.
  • It just creates a new segment and moves on.
  • This is why Elasticsearch handles high ingestion rates so well.

Disadvantage: Search Must Check All Segments

If your index has many segments and a user searches for “Apple”:

  • Elasticsearch must check every segment
  • Any segment could potentially contain documents with the term Apple
  • Results from all segments must be merged and consolidated

This can slow down search as the number of segments grows.

Segment Merging: The Solution

To balance this trade-off, Lucene uses a process called segment merging.

Periodically:

  • Many small segments are merged into fewer, larger segments
  • During the merge:
    • Document data is rewritten
    • Inverted indexes are merged
    • Deleted documents are physically removed

After merging:

  • There are fewer segments
  • Search becomes faster because fewer segments need to be queried
  • Larger segments are more efficient for lookups

This entire process is handled automatically by Lucene. As users of Elasticsearch, we do not have to manage this ourselves.

How Deletions Work with Immutable Segments

A natural question arises:

If segments are immutable, how does document deletion work?

The answer is deletion markers.

Step-by-Step Deletion Example

  1. Suppose product ID 10 exists in a segment and contains "Apple iPhone".
  2. You send a delete request for product ID 10.
  3. Lucene cannot modify the segment, so it does not remove the document.
  4. Instead, it writes a deletion marker in a separate internal structure saying:
    • Document ID 10 is deleted

During Search

  • A segment may still return product ID 10 as a match.
  • Lucene checks the deletion markers.
  • Since document 10 is marked deleted, it is filtered out from the final results.

During Segment Merge

  • When segments are merged later:
    • Deleted documents are completely removed
    • The new merged segment no longer contains them at all

How Updates Work: Delete + Insert

There is no true update inside a segment.

An update is implemented as:

  1. Delete the old version of the document (via a deletion marker)
  2. Insert the new version of the document into a new segment

Using the same example:

  • Product ID 10 is updated
  • Lucene marks the old document as deleted
  • A new document with ID 10 and updated data is added to a newer segment

Later, during merging, only the updated version survives.

Why This Architecture Works So Well

This segment-based design gives Elasticsearch several important advantages:

  • Extremely fast writes, because existing data is never modified
  • Predictable performance, thanks to immutable data structures
  • Efficient parallel search, because segments can be searched independently
  • Automatic cleanup, via background merge operations

All of this complexity is hidden from the user, but it explains many Elasticsearch behaviors you may observe in production.