Learnitweb

Understanding How Documents Are Stored in Elasticsearch

In this short but important section, we will understand how Elasticsearch internally stores documents, and how this structure differs from what we usually see in a relational database.

This concept is extremely important because once you understand how Elasticsearch represents data internally, everything else—searching, updating, deleting, and versioning—starts making much more sense.

How Data Looks in a Relational Database

In a traditional relational database, data is stored in tables.

Each table:

  • Has rows and columns
  • Each row represents a record
  • Each column represents a field

For example, in a books table:

idtitleauthorprice
1Book AAuthor X399
2Book BAuthor Y499

Here:

  • id is usually an auto-generated primary key
  • Every time we insert a new record, a new row is created
  • The structure is fixed and schema-driven

This is the traditional relational model most developers are familiar with.

How Elasticsearch Stores Data

Elasticsearch works differently. Instead of rows and tables, it uses:

  • Indexes (similar to tables)
  • Documents (similar to rows)
  • Fields (similar to columns)

However, when you retrieve a document from Elasticsearch, you will notice that the structure looks quite different.

A Typical Elasticsearch Document Response

When you fetch a document from Elasticsearch, you will see something like this:

{
  "_index": "books",
  "_id": "abc123",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "_source": {
    "title": "The Alchemist",
    "author": "Paulo Coelho",
    "price": 399
  }
}

This response contains two major parts:

  1. Metadata fields (fields starting with _)
  2. Actual document data (_source)

Let’s understand both clearly.

1. Metadata Fields (Fields Starting with _)

All fields that start with an underscore (_) are called metadata fields.

These fields are automatically managed by Elasticsearch, and every document will have them.

_index

This tells you which index the document belongs to.

Example:

"_index": "books"

This means the document is stored inside the books index.

_id

This is the unique identifier for the document.

  • It can be auto-generated by Elasticsearch
  • Or you can provide your own ID while inserting the document

Every document must have a unique _id.

_version

This represents the version number of the document.

Each time the document is updated, this number increases.
It helps Elasticsearch manage data consistency.

_seq_no and _primary_term

These two fields are related to optimistic concurrency control.

Their main purpose is:

  • To avoid conflicts when multiple requests try to update the same document at the same time
  • To ensure data consistency in distributed systems

For now, you don’t need to deeply understand these fields.
We will discuss them later when we talk about updates and concurrency control.

2. The _source Field (Most Important Part)

The _source field is the actual document data that you stored.

Example:

"_source": {
  "title": "The Alchemist",
  "author": "Paulo Coelho",
  "price": 399
}

This is the real content of your document.

Everything else outside _source exists mainly for:

  • Metadata management
  • Versioning
  • Replication
  • Conflict resolution

When people say “document data”, they usually mean what is inside _source.