This tutorial is a quick but very important clarification, because many people casually use the terms inverted index and term dictionary interchangeably. While this is common in conversations, they are not exactly the same thing.
To build a correct mental model of how Elasticsearch works internally (through Apache Lucene), we must clearly understand how these pieces fit together.
The Big Picture
- Inverted index is the entire data structure used by Lucene for search.
- Term dictionary is one component of the inverted index.
- Posting list is another component of the inverted index.
So conceptually, the structure looks like this:
Inverted Index
├── Term Dictionary
│ ├── Sorted terms
│ ├── Metadata (document frequency, etc.)
│ └── Pointer to posting list
└── Posting Lists
├── Document IDs
├── Term frequency
└── Term positions
Because these components are tightly coupled, people often loosely refer to all of them as the “inverted index,” which is why the confusion exists.
What Is the Term Dictionary?
The term dictionary is essentially a sorted list of all unique terms that appear in the indexed documents.
For each term, it stores:
- The term itself, kept in sorted order for fast lookup.
- Document frequency, which tells how many documents contain this term.
- A pointer to the posting list, where detailed information is stored.
The term dictionary does not store document-level details directly. Instead, it acts as an index of terms, guiding Elasticsearch to the right posting list.
What Is the Term Dictionary?
The term dictionary is essentially a sorted list of all unique terms that appear in the indexed documents.
For each term, it stores:
- The term itself, kept in sorted order for fast lookup.
- Document frequency, which tells how many documents contain this term.
- A pointer to the posting list, where detailed information is stored.
The term dictionary does not store document-level details directly. Instead, it acts as an index of terms, guiding Elasticsearch to the right posting list.
What Is a Posting List?
The posting list contains the actual details needed during search and scoring:
- Which document IDs contain the term.
- Term frequency in each document (how many times the term appears).
- Positions of the term within each document (used for phrase queries and relevance scoring).
This separation allows Elasticsearch to quickly jump from a term to the exact documents and positions where it appears.
Understanding Term Frequency and Document Frequency
To clearly understand term frequency and document frequency, let us use a better example than product names.
Example Documents
Document ID 1
Sam likes coffee. He always starts his day with coffee. A strong coffee can keep you awake.
Document ID 2
Some prefer tea over coffee.
We will ignore all other words and focus only on coffee and tea.
Term Dictionary for This Example
The terms are sorted alphabetically:
| Term | Document Frequency |
|---|---|
| coffee | 2 |
| tea | 1 |
Why?
- coffee appears in both Document 1 and Document 2, so its document frequency is 2.
- tea appears only in Document 2, so its document frequency is 1.
Each term now points to its corresponding posting list.
Posting List Details
Posting List for “coffee”
| Document ID | Positions | Term Frequency |
|---|---|---|
| 1 | 2, 9, 12 | 3 out of 17 |
| 2 | 4 | 1 out of 5 |
Explanation:
- In Document 1, the word coffee appears:
- At positions 2, 9, and 12 (Lucene uses zero-based indexing).
- A total of 3 times in a document containing 17 terms.
- In Document 2, coffee appears:
- Once, at position 4.
- 1 time in a document containing 5 terms.
This is what term frequency means:
How many times a term appears within a specific document.
Posting List for “tea”
| Document ID | Positions | Term Frequency |
|---|---|---|
| 2 | 2 | 1 out of 5 |
- tea appears only once, and only in Document 2.
- Therefore, its document frequency is 1.
Key Definitions
- Document Frequency (DF)
The number of documents that contain a given term, regardless of how many times it appears in each document. - Term Frequency (TF)
The number of times a term appears within a single document, often normalized by the total number of terms in that document.
Why Does Elasticsearch Store All This Information?
All of this detailed information—document frequency, term frequency, and positions—is stored for one major reason:
Relevance scoring.
Later, when a user performs a search, Elasticsearch uses these values to calculate how relevant each document is to the query. Concepts such as TF-IDF and BM25 rely directly on:
- How common a term is across documents (document frequency).
- How important the term is within a document (term frequency).
- Where the term appears (positions).
Without this structure, meaningful ranking would not be possible.
