Comprehensive Observability using Prometheus, Zipkin & Grafana

Comprehensive Observability using Prometheus, Zipkin & Grafana

Published
July 24, 2025
Author
Chandra Wijaya
Tags

What is O11y?

notion image
O11y stands for observability, and it talks about an ability to understand what is happening in our system or application by examining the system outputs, like data, logs, or metrics, etc. It enables not only engineers, but also other IT departments like DevOps, SRE teams, or even standby system support, to understand the behavior of a system/app, to know what gone wrong (if that happens), why it gone that way, find the root cause, and do action about it. O11y takes monitoring to higher level by some reasons even though they are similar.
Focus
Monitoring tracks predefined metrics and alerting to known issues while o11y focuses on understanding the state of the system by analyzing its outputs. It is said that monitoring deals with “known unknowns” while observability deals with “unknown unknowns”.
Approach
Monitoring is like “reactive” approach, we do something when something is wrong based on pre-set thresholds. However, observability is more like “proactive” approach, often incorporates extra situational data and historical data to analyze the root cause of “why” that monitoring alerts occurred.
Look at it this way, monitoring is like a car dashboard, showing us basic information of the car, like fuel level, how fast we run, even in modern car it tells more like tire pressure level, temperature, GPS, and so on. While observability is a car mechanic. Say that one day when you’re driving you feel something unusual with your car, assume it shown in the car dashboard. Oh, you see an alert. That’s monitoring. Then you take your car to the service center to have it checked. You tell the mechanic what happened. Then the mechanic use that information you gave him to diagnose what’s happening and how to fix it (by checking your car, of course). Now that is observability.
notion image

Why it matters

First, user experience. Let me make this short, if you make a system or application and you don’t care about your users, then you probably don’t need observability, even you don’t need monitoring at all. It may sound stupid, but rationally, if you make something, you would want someone else to use it, to feel its usefulness, to make them want to use your product, right? So it definitely doesn’t make sense not to care about users. And that elaborated to how the users would feel using the product, what make them want to use it.
I’m not talking about the business value the product offers, like “Oh, it’s a banking app, so people would want to use it for financial purposes, saving money, investments, or credit cards or applying loans”. But if that banking app performance sucks, like “slow process when users doing fund transfer, or unknown error when making payment at supermarket, unreliable performance to inquiry account balance”, and there are other banks offer more reliable app as alternatives, which one do you think people will use?
notion image
This is rhetorical question, the answer is obvious, and now we understand the importance of ensuring the user experience of our application is crucial. It is like dead or alive choice for the business itself. And as a result, IT teams face mounting pressure to track and respond to issues much faster degree. This is where observability comes to play.
Second, tech evolves dramatically. As organizations embrace new tech concepts like distributed system, cloud, DevOps, bla bla bla, system architectures have dramatically increased in complexity and scale. Simple systems have fewer moving parts, making them easier to manage. Monitoring CPU, memory, databases and networking conditions is usually enough to understand these systems and apply the appropriate fix to a problem.
notion image
Distributed systems have a far higher number of interconnected parts, so the number and types of failures that can occur is higher too. Additionally, distributed systems are constantly updated, and every change can create a new type of failure. In a distributed environment, understanding a current problem is an enormous challenge, largely because it produces more “unknown unknowns” than simpler systems.

How to become observable

 

3 Pillars of Observability

 

Observability Tech

 

The distinction between metrics and tracing

By metrics, we specifically mean the class of information that allows you to reason about the performance of a system in the aggregate (across different components of a single app, instances of an in a cluster, clusters operating in different environments or regions, etc.).
⚠️
Notably this excludes information intended to reason about the contribution of various components to the total latency of a single request as it passes through a series of services; this is the responsibility of distributed tracing collectors like Spring Cloud Sleuth, Zipkin’s Brave, etc.
Distributed tracing systems provide detailed information about subsystem latency, but generally downsample in order to scale (e.g. Spring Cloud Sleuth ships 10% of samples by default). Metrics data is generally pre-aggregated and so naturally lacks correlative information, but is also not downsampled. So, for a series of 100,000 requests in a minute interval that feature an interaction with service A and, depending on the input, maybe an interaction with service B:
  1. Metrics data will tell you that in the aggregate, service A’s observed throughput was 100k requests and service B’s observed throughput was 60k requests. Additionally, in that minute, service A’s max overall average latency was 100ms and service B’s max overall average latency was 50ms. It will also provide information on maximum latencies and other distribution statistics in that period.
  1. A distributed tracing system will tell you that for a particular request (but not the entire population of requests, because remember downsampling is happening), service A took 50 ms and service B took 90ms.
You might reasonably infer from the metrics data that roughly half the time spent in the worst-case user experience was spent in each of A and B, but you can’t be certain since you are looking at an aggregate, and it is entirely possible that in the worst case all 100ms was spent in service A and B was never called at all.
Conversely, from tracing data you cannot reason about throughput over an interval or the worst-case user experience.
 

Understanding Prometheus Metrics (label & attr)

 

Email Alert Scenario

 
 

Zipkin

What is Trace & Spans

 
Not useful in monolithic application? THINK AGAIN! Aren’t we all using API through HTTP at least? Then, it is important to also have monitoring against that.
 

Support scenario IRL

How we can utilize these technologies to solve production problem?
 

References

How to Monitor Spring Boot Application With Prometheus and Grafana
In this video, we will be exploring how we can monitor our spring boot application using Prometheus and Grafana. We would gather information using Prometheus and then visualise the data on a Grafana dashboard. Here is is a link to a community provided dashboard: https://grafana.com/grafana/dashboards/6756 If you find problems working the dashboard, you can always refer to my Github repo. Link is present in the article on my website https://refactorfirst.com : https://refactorfirst.com/spring-boot-prometheus-grafana You can support me by buying me a coffee 😄 : https://www.buymeacoffee.com/amrutprabhu Gear I use: Sony Alpha a6000 : Amazon India : https://amzn.to/3RM7QMJ Germany : https://amzn.to/3G1Iw24 US : https://amzn.to/3cpKZ9E UK : https://amzn.to/3J0g5Ry Razer Microphone : Amazon India : https://amzn.to/3aQxI9C Germany : https://amzn.to/3DRRrAh US : https://amzn.to/3ITKu3Q UK : https://amzn.to/3v3RHbU -------Chapters----- 00:00 - Introduction 00:39 - Creating Application 01:25 - Code Walkthrough 02:04 - Prometheus Configuration 02:27 - Starting Application 02:40 - Understanding Prometheus Metrics Endpoint 03:30 - Prometheus Docker Compose Configuration 04:20 - Understanding Prometheus Configuration 05:40 - Understanding Node Exporter 06:10 - Starting Prometheus Docker Compose 06:32 - Using Prometheus Interface 06:42 - Understanding Prometheus Use 08:04 - Grafana Docker Compose Configuration 08:40 - Starting Grafana Docker Compose 09:11 - Adding Prometheus Data Source 09:35 - Creating Grafana Dashboard 09:45 - Creating and Understanding Grafana Query 10:28 - Creating Grafana DashBoard with Log Query 11:55 - Grafana Rate Query 13:05 - Full Fledged Spring Boot DashBoard 14:22 - Conclusion Music Credits:- Sappheiros - Rain ---------------------------- https://soundcloud.com/sappheirosmusic/rain1 Social Media YouTube youtube.com/c/sappheiros Spotify goo.gl/hE9MDJ Twitter twitter.com/SappheirosMusic Instagram instagram.com/sappheirosmusic Facebook facebook.com/SappheirosMusic Discord discord.gg/Pk87yN9 ----------------------------------------------------------
How to Monitor Spring Boot Application With Prometheus and Grafana