In the next few tutorials, we will gradually build an understanding of how Elasticsearch works internally. The goal here is not to overload you with theory all at once, but to give you a clear, high-level mental model of how search engines work in general. Once this foundation is strong, many advanced Elasticsearch concepts will start making sense naturally.
At the heart of Elasticsearch lies another powerful technology: Apache Lucene.
Elasticsearch and Apache Lucene: Who Does What?
Whenever you send documents to Elasticsearch—whether for storing, searching, or retrieving—Elasticsearch itself does not do all the low-level work. Instead:
- Apache Lucene is responsible for:
- Storing documents on disk
- Indexing documents
- Searching and retrieving matching documents efficiently
- Elasticsearch acts as:
- A distributed REST server built on top of Lucene
- A system that makes Lucene scalable, highly available, and easy to use
- A layer that exposes clean HTTP APIs so applications can interact with Lucene without embedding it directly
This distinction is important. Lucene is not a server. It is a Java library designed to be used as a dependency inside applications. Elasticsearch fills this gap by turning Lucene into a full-fledged distributed search engine.
So while Lucene is the core search engine, Elasticsearch is what makes it production-ready.
A Real-World Example: Product Search in an E-commerce Application
Let us imagine an e-commerce application similar to Amazon. We create an index called products and insert a few documents:
- Document 1:
Apple iPhone 14 - Document 2:
Samsung Galaxy S22 - Document 3:
Apple MacBook Pro
These documents are sent to Elasticsearch.
What Happens Internally?
- Lucene parses each document and begins the process known as analysis.
- Elasticsearch receives the documents.
- Elasticsearch passes them to Lucene.
Analysis and Tokenization (High-Level View)
During analysis, Lucene examines text fields and breaks them into tokens. For example:
"Apple iPhone 14"→Apple,iPhone,14"Samsung Galaxy S22"→Samsung,Galaxy,S22"Apple MacBook Pro"→Apple,MacBook,Pro
Important clarification:
Lucene does not modify or lose your original document. The original JSON document is stored safely as-is. Tokenization is only used to build internal data structures that make searching fast.
Tokenization is just one part of the broader analysis process, which we will study in detail later.
The Inverted Index: The Core Idea
After tokenization, Lucene builds a data structure called an inverted index.
Conceptually, it looks like this:
| Term | Document IDs | Positions |
|---|---|---|
| Apple | 1, 3 | 0, 0 |
| iPhone | 1 | 1 |
| Samsung | 2 | 0 |
| Galaxy | 2 | 1 |
| MacBook | 3 | 1 |
| Pro | 3 | 2 |
What Is Being Stored?
- Term: A token extracted from text
- Document ID: Which documents contain that term
- Position: Where the term appears in the document (used for phrase queries, relevance scoring, etc.)
This structure allows Lucene to answer search queries extremely fast.
Why Is This Powerful?
Imagine a user types “Apple” into the search box.
- Lucene looks up the term
Applein the inverted index. - It immediately finds document IDs 1 and 3.
- The result is returned without scanning all documents.
This is fundamentally different from how relational databases work.
Comparison with Relational Databases
Let us imagine storing the same data in a relational database like Postgres or MySQL:
| product_id | name |
|---|---|
| 1 | Apple iPhone 14 |
| 2 | Samsung Galaxy S22 |
| 3 | Apple MacBook Pro |
To search for “Apple”, you would write:
SELECT * FROM products WHERE name LIKE '%Apple%';
What Happens Internally?
- The database must scan every row.
- It checks whether each value contains the word “Apple”.
- This approach scales poorly as the table grows.
Elasticsearch vs Relational Databases: Intuition
A useful mental model:
- Elasticsearch search behavior is similar to a Java
HashMaplookup, where access is close to constant time. - Relational database text search is similar to calling
contains()on a list, which requires scanning each element.
This is why Elasticsearch excels at full-text search, while relational databases struggle at scale.
Searching for Multiple Terms
Now suppose the user searches for “Apple iPhone”.
What happens?
- Elasticsearch tokenizes the input query:
Apple,iPhone
- It looks up both terms in the inverted index:
Apple→ documents 1 and 3iPhone→ document 1
- Elasticsearch finds the intersection:
- Document 1 contains both terms
This is why Elasticsearch can efficiently handle multi-word queries.
Multiple Fields, Multiple Inverted Indexes
In earlier examples, we assumed a single field. In real applications, documents often contain many fields:
{
"name": "Apple iPhone 14",
"description": "Latest Apple smartphone with A15 chip",
"category": "mobile"
}
In such cases:
- Each searchable field has its own inverted index
- Queries can target one or more fields
- Elasticsearch combines results intelligently
This design is what allows flexible and powerful search queries.
Why Is It Called an “Inverted” Index?
In relational databases:
- The primary key (ID) is indexed
- You retrieve rows by ID quickly
- Text values are stored as plain data
In Lucene and Elasticsearch:
- The model is inverted
- The term becomes the key
- The value is the list of document IDs containing that term
Because the traditional structure is flipped, this data structure is called an inverted index.
Is Elasticsearch Slow for ID-Based Lookups?
A common question is:
“If Elasticsearch is optimized for term searches, will retrieving a document by ID be slow?”
The answer is no.
Behind the scenes, Lucene uses different data structures optimized for different access patterns. Retrieving a document by ID is still very fast. We will explore these internal structures later.
