Learnitweb

How Elasticsearch Works Behind the Scenes

In the next few tutorials, we will gradually build an understanding of how Elasticsearch works internally. The goal here is not to overload you with theory all at once, but to give you a clear, high-level mental model of how search engines work in general. Once this foundation is strong, many advanced Elasticsearch concepts will start making sense naturally.

At the heart of Elasticsearch lies another powerful technology: Apache Lucene.

Elasticsearch and Apache Lucene: Who Does What?

Whenever you send documents to Elasticsearch—whether for storing, searching, or retrieving—Elasticsearch itself does not do all the low-level work. Instead:

  • Apache Lucene is responsible for:
    • Storing documents on disk
    • Indexing documents
    • Searching and retrieving matching documents efficiently
  • Elasticsearch acts as:
    • A distributed REST server built on top of Lucene
    • A system that makes Lucene scalable, highly available, and easy to use
    • A layer that exposes clean HTTP APIs so applications can interact with Lucene without embedding it directly

This distinction is important. Lucene is not a server. It is a Java library designed to be used as a dependency inside applications. Elasticsearch fills this gap by turning Lucene into a full-fledged distributed search engine.

So while Lucene is the core search engine, Elasticsearch is what makes it production-ready.

A Real-World Example: Product Search in an E-commerce Application

Let us imagine an e-commerce application similar to Amazon. We create an index called products and insert a few documents:

  • Document 1: Apple iPhone 14
  • Document 2: Samsung Galaxy S22
  • Document 3: Apple MacBook Pro

These documents are sent to Elasticsearch.

What Happens Internally?

  • Lucene parses each document and begins the process known as analysis.
  • Elasticsearch receives the documents.
  • Elasticsearch passes them to Lucene.

Analysis and Tokenization (High-Level View)

During analysis, Lucene examines text fields and breaks them into tokens. For example:

  • "Apple iPhone 14"Apple, iPhone, 14
  • "Samsung Galaxy S22"Samsung, Galaxy, S22
  • "Apple MacBook Pro"Apple, MacBook, Pro

Important clarification:
Lucene does not modify or lose your original document. The original JSON document is stored safely as-is. Tokenization is only used to build internal data structures that make searching fast.

Tokenization is just one part of the broader analysis process, which we will study in detail later.

The Inverted Index: The Core Idea

After tokenization, Lucene builds a data structure called an inverted index.

Conceptually, it looks like this:

TermDocument IDsPositions
Apple1, 30, 0
iPhone11
Samsung20
Galaxy21
MacBook31
Pro32

What Is Being Stored?

  • Term: A token extracted from text
  • Document ID: Which documents contain that term
  • Position: Where the term appears in the document (used for phrase queries, relevance scoring, etc.)

This structure allows Lucene to answer search queries extremely fast.

Why Is This Powerful?

Imagine a user types “Apple” into the search box.

  • Lucene looks up the term Apple in the inverted index.
  • It immediately finds document IDs 1 and 3.
  • The result is returned without scanning all documents.

This is fundamentally different from how relational databases work.

Comparison with Relational Databases

Let us imagine storing the same data in a relational database like Postgres or MySQL:

product_idname
1Apple iPhone 14
2Samsung Galaxy S22
3Apple MacBook Pro

To search for “Apple”, you would write:

SELECT * FROM products WHERE name LIKE '%Apple%';

What Happens Internally?

  • The database must scan every row.
  • It checks whether each value contains the word “Apple”.
  • This approach scales poorly as the table grows.

Elasticsearch vs Relational Databases: Intuition

A useful mental model:

  • Elasticsearch search behavior is similar to a Java HashMap lookup, where access is close to constant time.
  • Relational database text search is similar to calling contains() on a list, which requires scanning each element.

This is why Elasticsearch excels at full-text search, while relational databases struggle at scale.

Searching for Multiple Terms

Now suppose the user searches for “Apple iPhone”.

What happens?

  1. Elasticsearch tokenizes the input query:
    • Apple, iPhone
  2. It looks up both terms in the inverted index:
    • Apple → documents 1 and 3
    • iPhone → document 1
  3. Elasticsearch finds the intersection:
    • Document 1 contains both terms

This is why Elasticsearch can efficiently handle multi-word queries.

Multiple Fields, Multiple Inverted Indexes

In earlier examples, we assumed a single field. In real applications, documents often contain many fields:

{
  "name": "Apple iPhone 14",
  "description": "Latest Apple smartphone with A15 chip",
  "category": "mobile"
}

In such cases:

  • Each searchable field has its own inverted index
  • Queries can target one or more fields
  • Elasticsearch combines results intelligently

This design is what allows flexible and powerful search queries.

Why Is It Called an “Inverted” Index?

In relational databases:

  • The primary key (ID) is indexed
  • You retrieve rows by ID quickly
  • Text values are stored as plain data

In Lucene and Elasticsearch:

  • The model is inverted
  • The term becomes the key
  • The value is the list of document IDs containing that term

Because the traditional structure is flipped, this data structure is called an inverted index.

Is Elasticsearch Slow for ID-Based Lookups?

A common question is:

“If Elasticsearch is optimized for term searches, will retrieving a document by ID be slow?”

The answer is no.

Behind the scenes, Lucene uses different data structures optimized for different access patterns. Retrieving a document by ID is still very fast. We will explore these internal structures later.