Learnitweb

Hive Architecture

In this tutorial, we are going to deeply understand the architecture of Hive, how its components are structured, and how a query flows through the system from start to finish in a real execution scenario.

Introduction to Hive Architecture

Hive architecture consists of five major components, and one of these components is tightly integrated with the Hadoop framework, which is responsible for actual data storage and processing.

These five components are:

  • User Interface (UI)
  • Driver
  • Compiler
  • Metastore
  • Execution Engine

We will now go through each of these components one by one in a detailed and structured manner so that you clearly understand their responsibilities and interactions.

1. User Interface (UI)

The User Interface is the component through which users directly interact with Hive and submit their queries for execution.

Whenever you write a Hive query, it is always entered through some form of user interface, making this the starting point of the entire architecture.

Types of User Interfaces in Hive

There are three main ways through which a user can interact with Hive and execute queries:

  • Hive Command Line Interface (CLI)
    This is the most basic and traditional way of interacting with Hive, where users type commands directly into the Hive shell and receive results in the same terminal.
  • Hive Web Interface
    This provides a browser-based interface that allows users to execute queries without using a terminal, making it more user-friendly for certain use cases.
  • Thrift Server (JDBC/ODBC Connections)
    This allows external applications to connect to Hive using JDBC or ODBC drivers, enabling integration with tools like Java applications, BI tools, or reporting systems.

So, every Hive query always begins its journey from the User Interface.

2. Driver

The Driver is responsible for managing the lifecycle of a query and acts as a bridge between the user interface and the internal processing components of Hive.

Responsibilities of Driver

  • The driver receives the query submitted by the user through the user interface and starts the processing workflow.
  • It fetches the required APIs that are modeled on JDBC and ODBC interfaces, which are necessary for handling the query execution.
  • It coordinates the entire execution process and ensures that each component is invoked in the correct sequence.
  • It converts the Hive query into a MapReduce program with the help of the compiler, because internally Hive always executes queries using distributed processing.

You can think of the driver as a central coordinator that ensures the query moves correctly through all stages of execution.

3. Compiler

The Compiler plays a crucial role in analyzing and transforming the Hive query into an executable plan.

Responsibilities of Compiler

  • The compiler performs semantic analysis of the query, which means it checks whether the query is logically correct, such as verifying table names and column names.
  • It parses the query and breaks it down into smaller logical steps that can be executed.
  • It generates an execution plan, which is a step-by-step representation of how the query will be executed.

While generating this execution plan, the compiler heavily depends on the Metastore to retrieve structural information about the data.

4. Metastore

The Metastore is a central repository that stores all the metadata information related to Hive tables and their structure.

Information Stored in Metastore

  • Table names and database structure
  • Column names and their data types
  • Partition details
  • Number of columns in tables
  • Serialization and deserialization information

Default vs Real-Time Usage

  • By default, Hive uses Derby database as its Metastore, which is lightweight but has a major limitation of supporting only a single process at a time.
  • Because of this limitation, Derby is not suitable for real-world environments where multiple users need to access Hive simultaneously.
  • In production environments, databases like MySQL or PostgreSQL are used as Metastore because they support multiple concurrent connections.

The Metastore is extremely important because it provides the structural context required for query validation and execution planning.

5. Execution Engine

The Execution Engine is the component that actually executes the query by interacting with the Hadoop framework.

Responsibilities of Execution Engine

  • It receives the execution plan generated by the compiler.
  • It communicates with Hadoop components such as the NameNode and Resource Manager to locate and process data stored in HDFS.
  • It executes the query in the form of distributed processing tasks (such as MapReduce).
  • After execution, it sends the results back to the user interface through the driver.

This is the stage where the actual data processing takes place and results are generated.

Query Execution Flow in Hive

Now that we understand all the components, let us see how a Hive query flows through the architecture step by step.

Example Query

SELECT MAX(price) AS maximum_price FROM table_1;

Step 1: Query Submission

The user submits the query through the User Interface, such as the Hive CLI or any application connected via JDBC/ODBC.

Step 2: Driver Processing

The driver receives the query, fetches the required APIs, and forwards the query to the compiler for further processing.

Step 3: Compilation Phase

The compiler performs semantic analysis to ensure that the query is valid, such as checking whether table_1 exists and whether the column price is present.

Step 4: Metastore Interaction

During compilation, the compiler interacts with the Metastore to retrieve structural information about table_1, including column details and data types.

This information is essential for creating a correct execution plan.

Step 5: Execution Plan Creation

After validation and analysis, the compiler generates an execution plan that defines how the query will be executed step by step.

Step 6: Execution Engine Processing

The execution plan is passed to the Execution Engine, which interacts with Hadoop components to locate data in HDFS and execute the required operations to compute the maximum value of the price column.

Step 7: Result Return

Once the Hadoop job completes execution, the result is sent back through the Execution Engine and Driver to the User Interface, where it is displayed to the user.

Final Summary

The complete flow of a Hive query can be summarized as a pipeline where the query moves from User Interface → Driver → Compiler → Metastore → Execution Engine → Hadoop → and finally back to the user.

Each component plays a specific role, and together they enable Hive to process large-scale data efficiently using distributed computing.