Learnitweb

The Three Pillars of Observability: Logs, Metrics, and Traces

In the previous tutorial, we defined observability as the ability to understand the internal state of a system by analyzing the outputs that the application produces. These outputs are often called signals or telemetry signals, because they continuously describe what the system is doing and how it is behaving from the inside.

In modern observability practices, these signals are grouped into three fundamental categories, which are commonly referred to as the three pillars of observability:

  • Metrics
  • Logs
  • Traces

Each of these signals looks at the system from a different angle, and each of them answers a different type of question. When we use all three together, we get a much more complete and reliable understanding of what is really happening inside a complex system.

If these concepts do not feel completely clear right now, that is perfectly fine, because we will discuss each of them in much greater detail later in the course. For now, the goal is to build a strong intuitive understanding of what each signal represents and why it exists.

1. Metrics: Understanding Trends and System Health Over Time

Metrics are numeric measurements collected at regular intervals from an application or system. Each metric represents a value at a specific point in time, and when we store these values over a long period, we can start to observe trends, patterns, and anomalies in the behavior of the system.

Some important characteristics of metrics are:

  • Metrics are quantitative and time-based.
    This means they always have a number and a timestamp associated with them, and this makes them ideal for drawing graphs, setting thresholds, and watching how a system behaves over minutes, hours, days, or even months.
  • Metrics are not limited to infrastructure data like CPU and memory.
    Of course, we can collect standard metrics such as CPU usage, memory usage, disk I/O, request rate, or error rate, and these are extremely useful for understanding the health of the system. However, metrics can also be application-specific and business-specific.
  • Application-specific metrics often tell a much richer story.
    For example, you might collect metrics such as:
    • Number of orders placed per minute
    • Number of items sold
    • Number of orders cancelled
    • Available inventory
    • Number of threads in waiting state or running state
    These metrics are not just technical; they also describe how the business and the application are behaving.
  • Metrics are especially good for dashboards, alerts, and long-term monitoring.
    Because metrics are numerical and structured, they are perfect for creating dashboards and for defining alert rules such as “alert me if CPU usage goes above 90%” or “alert me if error rate goes above 5%”.

In short, metrics answer questions like:

Is the system healthy? Is it getting slower over time? Is the load increasing? Is something crossing a dangerous threshold?

2. Logs: Understanding What Exactly Happened

Logs are timestamped records of events that happened inside the system. You can think of logs as the diary of the application, where the application writes down important things that it is doing, things that went wrong, and sometimes even detailed diagnostic information.

Some important characteristics of logs are:

  • Logs are usually human-readable and descriptive.
    Unlike metrics, which are just numbers, logs usually contain textual information that explains what exactly happened, in which part of the code, and often with what data.
  • Logs are extremely useful for debugging specific problems.
    For example, if a payment fails, a well-written log might tell you:
    • The exact time the failure happened
    • Which part of the code failed
    • What exception was thrown
    • And sometimes even the full stack trace
    This kind of information is invaluable when you are trying to understand why something broke.
  • Logs are event-focused rather than trend-focused.
    While metrics are great for seeing long-term trends, logs are much better at answering questions like: What exactly happened during this particular request? What error occurred? What path did the code take?

In simple terms, logs answer questions like:

What happened? Where did it happen? What was the error or event?

3. Traces: Understanding the End-to-End Journey of a Request

Traces represent the complete end-to-end journey of a single request as it flows through a distributed system. This is especially important in microservices architectures, where a single user action might involve many services talking to each other.

Some important characteristics of traces are:

  • A trace shows the sequence of operations involved in handling one request.
    It does not just show the total time taken, but also shows which service called which other service, in what order, and how much time each step took.
  • Traces are critical for performance analysis in distributed systems.
    In a microservices environment, it is very common that a slow response is not caused by just one service, but by a chain of calls, where one slow dependency slows down everything else.

Let us take a concrete example.

Imagine you send a request like this:

POST /orders

Suppose the total response time for this request is 400 milliseconds. If you only look at this number, you still do not know why it took 400 milliseconds.

With tracing, you might discover something like this:

  • The Order Service receives the request.
  • It calls the Product Service to fetch product information.
    • The Product Service executes a database query that takes 70 ms.
  • Then the Order Service calls the Inventory Service.
    • That call takes 10 ms.
  • Then the Order Service calls the Payment Service.
    • That call takes 120 ms.
  • Finally, the Order Service processes the result and sends the response.

Now you can clearly see:

  • The sequence of operations
  • The time spent in each service
  • And which part is contributing most to the total latency
  • Traces also carry status and error information.
    For example, you might see that:
    • One service returned 200 OK
    • Another service returned 500 Internal Server Error

This makes it possible not only to understand why a request is slow, but also where exactly it failed when something goes wrong.

In simple terms, traces answer questions like:

How did this request move through the system? Where did it spend time? Where did it fail?

Why We Need All Three Together

Each of these signals looks at the system from a different perspective:

  • Metrics are excellent for health, trends, and alerting.
  • Logs are excellent for detailed event-level debugging.
  • Traces are excellent for understanding request flows and performance bottlenecks in distributed systems.

On their own, each signal is useful, but none of them is sufficient by itself. When you use metrics, logs, and traces together, you get a complete and connected picture of how the system is behaving, both at a high level and at a very detailed level.