Comprehensive O11y using Prometheus, Zipkin & Grafana

Comprehensive O11y using Prometheus, Zipkin & Grafana

Published
July 24, 2025
Tags
Tech
Web Dev
Author
Chandra Wijaya

The 3 AM Wake-Up Call That Changed Everything

So, I’m gonna start with a story.
It was 3:47 AM when Sarah's phone erupted with alerts. As the lead engineer at FinTech Solutions, she'd seen her share of production incidents, but this was different. Their payment processing system—handling millions of dollars in transactions daily—was experiencing mysterious slowdowns. Customer complaints were flooding in, and every minute of downtime meant lost revenue and eroding trust.
Sarah's team scrambled to diagnose the issue. They checked server health: CPU and memory looked normal. Database queries? Running fine. Network connectivity? All green. Yet customers were timing out during checkout, and the team was flying blind.
After two hours of frantic investigation and a $2.3 million revenue loss, they finally discovered the culprit: a third-party payment gateway had introduced subtle latency in their API responses. But here's the kicker—this could have been caught in minutes, not hours, if they had proper observability in place.
Six Months Later: A Different Story
Fast forward to a similar scenario. At 2:15 AM, the system detected an anomaly. But this time, the on-call engineer, Marcus, received a contextualized alert with the exact service endpoint, and even the suspected root cause. He opened his laptop to find:
  • Distributed traces: showing the payment gateway's response time had jumped from 200ms to 8 seconds
  • Correlated logs: revealing specific transaction IDs affected
  • Metrics dashboards: displaying the cascade effect across dependent services
  • Automatic rollback: already triggered based on predefined SLOs
Marcus confirmed the automated mitigation, notified the gateway provider, and the incident was resolved in 12 minutes. Most customers never even noticed. Total revenue impact? Less than $50,000, and zero customer complaints.
 
The Real Cost of Flying Blind
Sarah's story isn't unique. Across the industry, engineering teams face similar challenges. And these weren't failures of engineering talent. They were failures of visibility.
notion image

Let’s talk about O11y.

O11y stands for observability, not to be confused with Observable. It talks about an ability to understand what is happening in our system or application by examining the system outputs, like data, logs, or metrics, etc. It enables not only SWEs, but also other IT departments like DevOps, SRE teams, or even standby system support, to understand the behavior of a system/app, to know what gone wrong (if that happens), why it gone that way, find the root cause, and do action about it.

We implemented monitoring already, with good alerting system.

Good! But observability takes monitoring to higher level, even though they are similar.
Focus : Monitoring tracks predefined metrics and alerting to known issues while o11y focuses on understanding the state of the system by analyzing its outputs. It is said that monitoring deals with “known unknowns” while observability deals with “unknown unknowns”.
Approach : Monitoring is like “reactive” approach, we do something when something is wrong based on pre-set thresholds. However, observability is more like “proactive” approach, often incorporates extra situational data and historical data to analyze the root cause of “why” that monitoring alerts occurred.
notion image
Look at it this way, monitoring is like a car dashboard, showing us basic information of the car, like fuel level, how fast we run, even in modern car it tells more like tire pressure level, temperature, GPS, and so on. While observability is more like a car mechanic. Say that one day when you’re driving you feel something unusual with your car. Assume it shown in the car dashboard, oh, you see an alert. That’s monitoring. Then you take your car to the service center to have it checked. You tell the mechanic what happened. Then the mechanic use that information you gave him to diagnose what’s happening and how to fix it (by checking your car, of course). Now, that is observability.

Why it matters

I believe Sarah’s and Marcus’ stories already told much of how important observability is.
First, user experience. Let me make this short, if you make a system or application and you don’t care about your users, then you probably don’t need observability, you don’t even need users at all. It may sound stupid, but rationally, if you make something, you would want someone else to use it, to feel its usefulness, to make them want to use your product, right? So it definitely doesn’t make sense not to care about users. And that elaborated to how the users would feel using the product, how to make them want to use it.
I’m not talking about the business value the product offers, such as “Oh, it’s a banking app, people will use it for financial purposes, saving money, investments, or credit cards or applying loans. People need it.”
True, but I’m talking more about on the “reliability” side, if that banking app performance sucks, like “slow process when users doing fund transfer, or unknown error when making payment at supermarket, unreliable performance to inquiry account balance”, and there are other banks offer more reliable app as alternatives, which one do you think people will use? This is a rhetorical question, and the answer is obvious! It is a hell of competition!
notion image
"Every minute of downtime isn't just lost revenue—it's lost trust." Our competitors were one click away. Customers who experienced failures rarely came back!
I hope now we understand the issue here. It is like dead or alive choice to the business itself. And as a result, IT teams face mounting pressure to track and respond to issues much faster degree. This is where observability comes to play. It plays the role of helping us, as the “IT” to ensure the users get the most reliable experience using our apps!
Second, we all know tech evolves dramatically. As organizations embrace new tech concepts like distributed system, cloud, DevOps, bla bla bla, system architectures have dramatically increased in complexity and scale. Simple systems have fewer moving parts, making them easier to manage. Monitoring CPU, memory, databases and networking conditions is usually enough to understand these systems and apply the appropriate fix to a problem.
notion image
Distributed systems have a far higher number of interconnected parts, so the number and types of failures that can occur is higher too. Additionally, distributed systems are constantly updated, and every change can create a new type of failure. In a distributed environment, understanding a current problem is an enormous challenge, largely because it produces more “unknown unknowns” than simpler systems.
"Code that works on your laptop isn't the same as code serving 10 million users." Production environments are complex, distributed, and unpredictable. Without observability, you're deploying into darkness.

Got it, so how can I become observable?

We must know the three main focuses of o11y:

3 Pillars of Observability

Logs
I have a blog about it here
Better Support Experience with Meaningful Logging
. Basically, logs are records of events that happened in our system containing other additional information, like when, what was brought or returned (payload), who triggered it, but these may vary depending what was added. Typically, logs are the first touch point we look when something is wrong.
We can say that logs are the "why" it's happening (detailed event records, error messages, transaction flows).
Traces
To put it simple, this is the "where" it's happening (request journeys across distributed services). Traces are like footprints. It tells the journey of the transaction / request. In a distributed system, where logs are separated into each microservice or even other system out of our reach, traces become crucial to retrieve the whole picture of that request moved around those systems. In other analogy it is more like a scratch on tree bark when we go camping. It could be located sporadically on trees. But as long as those scratches could be indicated as our “sign”, for example we drew arrow shape on each tree, then we can always find our way back.
In a system analogy, it is like a story from when the user start interacting with the UI of our app, like clicking buttons, signing in, doing transactions to where that interaction got responded by the entire architecture, such as DNS, load balancers, reverse proxies, backend APIs, third party APIs, databases, and so on.
Metrics
Metrics are the numeric measurements we use to quantify the performance of a service, for example, how much memory and CPU the application needs to run in 15 minutes, or how much latency the application experiences during usage spike, how long each node of system requires to process a request, etc.
This is the "what" is happening (request rates, error rates, latency percentiles).
 
“OK, my apps have all of them. Am I observable now?”
Could be. But if you don’t know how to use these three things smartly, you are not guaranteed observability just by having these three things. Imagine working with all those data all over the places independently? Come on…we need proper tools, my friend. Tools to integrate all of them into one complete solution. I’m not gonna sell you anything here, but there are a lot of options to choose to finally gain the observability which I’m gonna cover in a little bit.
Observability system
Observability system
Obviously, we can always make that tool by ourselves, but I would recommend to not go down that way. There are a lot of open source software for this purpose with complete and rich features we can imagine. And if we need enterprise support we can always buy commercial licensed software by world wide known companies. Typically, our observability system need to present these factors:
  • Data collection
    • Having all the system outputs collected in one place makes observability possible. It should facilitate the collection and aggregation of, and access to, CPU memory data, app logs, high availability numbers, average latency and other metrics.
  • Monitoring
    • I mentioned monitoring and observability does not eliminate each other. It is true, in fact, we can not have observability without monitoring. Teams must be able to view app and system data with ease, so observability tools set up dashboards to monitor application health, any related services and any relevant business objectives (often shown in graphs, gauge, charts…).
  • Analysis
    • The main process of understanding all of data we’ve collected is by analyzing it through some sort of numbers, dashboards, visualization, with all complete traces. I said it, complete traces. You’ve might known about APM (application performance monitoring) or even used it in your apps. That’s one place to monitor pre-defined metrics of your app, like periodic health check, CPU & memory capacity and usage, number of opened DB connections, connection pooling, error rates, etc.
      Nowadays, we are shifting to modern development practices, which if you’re not digging into distributed system just yet, agile development, CI/CD, DevOps, containerization, Kubernetes, cloud platform tech, would blow your mind. Traditional monitoring system would not be able to keep the pace. This is where APM and observability system differs (more on that here and here). In short, we need an analytic system which able to create and store real-time, high-fidelity, context-rich, fully correlated records of every application, user request and data transaction on the network.
  • AI (AIOps).
    • This is purely optional, but who wouldn’t sell AI nowadays? AI-driven observability enables development teams to proactively protect enterprise IT infrastructure instead of solving problems as they arise. By using ML algorithms, observability tools can parse through extensive data streams to find patterns, trends and anomalies, revealing insights that a human worker might overlook.

Observability Tech

Okay, so now let’s talk about how we implement observability in code. I cover only three products that are commonly used and I intend to deliver just the concept, the importance, and the awareness of how important this is. However, these products are just alternatives. There a lot of observability technologies out there, from OSS to Enterprise ones.

Prometheus

It is an open source systems monitoring and alerting toolkit. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.
Prometheus works well for recording any purely numeric time series. It fits both machine-centric monitoring as well as monitoring of highly dynamic service-oriented architectures. In a world of microservices, its support for multi-dimensional data collection and querying is a particular strength.

Zipkin

This is magic. Zipkin is specialized in distributed tracing system. It helps gather timing data needed to troubleshoot latency problems, especially in services architectures. Zipkin will automatically include two IDs that can be used for tracing, Trace ID and Span ID. With these two, we can collect the necessary data of the requests. More interestingly, it can be summarized for you, such as the percentage of time spent in a service, and whether or not operations failed.
What are Traces & Spans
In my previous blog,
Better Support Experience with Meaningful Logging
, I gave a tip to add correlationID or requestID to our logs. In Zipkin, we don’t need to add these IDs manually, instead Zipkin already provided traceID and spanID automatically to our logs. These two IDs are used as follows.
Traces and Spans indicate the journey of the request and each event occurred during the process.
Traces and Spans indicate the journey of the request and each event occurred during the process.
traceID uniquely identifies an entire request's journey across all microservices, while a spanID identifies the step, an individual unit of work (a single operation or hop) within that trace. Every span in a request shares the same traceID, but holds a unique spanID. In addition, a span will also contains parentSpanID which will allow Zipkin to create the structure of each event in the span itself.
With these IDs added by Zipkin, we can later see a comprehensive visualization in Zipkin UI (or any other dashboard you prefer) which will tell us about the status, the duration, the errors of a transaction (or request) in a whole complete story. A good log is like a fairytale, remember? 🙂
You might think: “Ah, my application is monolithic. I don’t need complex tracing system like this.”
Think again. Aren’t we, at least, using one API through HTTP?
Metrics is a class of information that allows you to reason about the performance of a system in the aggregate (across different components of a single app, instances of an in a cluster, clusters operating in different environments or regions, etc.).
⚠️
Notably this excludes information intended to reason about the contribution of various components to the total latency of a single request as it passes through a series of services; this is the responsibility of distributed tracing collectors like Spring Cloud Sleuth, Zipkin’s Brave, etc.
Distributed tracing systems provide detailed information about subsystem latency, but generally downsample in order to scale (e.g. Spring Cloud Sleuth ships 10% of samples by default). Metrics data is generally pre-aggregated and so naturally lacks correlative information, but is also not downsampled. So, for a series of 100,000 requests in a minute interval that feature an interaction with service A and, depending on the input, maybe an interaction with service B:
  1. Metrics data will tell you that in the aggregate, service A’s observed throughput was 100k requests and service B’s observed throughput was 60k requests. Additionally, in that minute, service A’s max overall average latency was 100ms and service B’s max overall average latency was 50ms. It will also provide information on maximum latencies and other distribution statistics in that period.
  1. A distributed tracing system will tell you that for a particular request (but not the entire population of requests, because remember downsampling is happening), service A took 50 ms and service B took 90ms.
You might reasonably infer from the metrics data that roughly half the time spent in the worst-case user experience was spent in each of A and B, but you can’t be certain since you are looking at an aggregate, and it is entirely possible that in the worst case all 100ms was spent in service A and B was never called at all.
Conversely, from tracing data you cannot reason about throughput over an interval or the worst-case user experience.

Grafana

We have the foundations, but we need visualization. Here comes Grafana. In short, it is an all-in-one analytic tool that would integrate our collected data from Prometheus and Zipkin. Of course, we can create beautiful charts and informative dashboards that can leverage data-based decision making.
Grafana Dashboard
Grafana Dashboard
There are many use cases we can use Grafana for, such as data visualization (integrating, transforming, etc.), real-time monitoring, and even alerting. It has broad data source support, including Prometheus, Elasticsearch, SQL databases, and even Zipkin!
Also, we are dealing with collected data and logs that deal with time. Guess what? Grafana uses TSDB or Time Series Database, so what else do we need? 😀 Seriously, if you are looking for an open source analytic tool, I would highly recommend Grafana. Learn more here.

Let’s put them in action

 
 

References

How to Monitor Spring Boot Application With Prometheus and Grafana
In this video, we will be exploring how we can monitor our spring boot application using Prometheus and Grafana. We would gather information using Prometheus and then visualise the data on a Grafana dashboard. Here is is a link to a community provided dashboard: https://grafana.com/grafana/dashboards/6756 If you find problems working the dashboard, you can always refer to my Github repo. Link is present in the article on my website https://refactorfirst.com : https://refactorfirst.com/spring-boot-prometheus-grafana You can support me by buying me a coffee 😄 : https://www.buymeacoffee.com/amrutprabhu Gear I use: Sony Alpha a6000 : Amazon India : https://amzn.to/3RM7QMJ Germany : https://amzn.to/3G1Iw24 US : https://amzn.to/3cpKZ9E UK : https://amzn.to/3J0g5Ry Razer Microphone : Amazon India : https://amzn.to/3aQxI9C Germany : https://amzn.to/3DRRrAh US : https://amzn.to/3ITKu3Q UK : https://amzn.to/3v3RHbU -------Chapters----- 00:00 - Introduction 00:39 - Creating Application 01:25 - Code Walkthrough 02:04 - Prometheus Configuration 02:27 - Starting Application 02:40 - Understanding Prometheus Metrics Endpoint 03:30 - Prometheus Docker Compose Configuration 04:20 - Understanding Prometheus Configuration 05:40 - Understanding Node Exporter 06:10 - Starting Prometheus Docker Compose 06:32 - Using Prometheus Interface 06:42 - Understanding Prometheus Use 08:04 - Grafana Docker Compose Configuration 08:40 - Starting Grafana Docker Compose 09:11 - Adding Prometheus Data Source 09:35 - Creating Grafana Dashboard 09:45 - Creating and Understanding Grafana Query 10:28 - Creating Grafana DashBoard with Log Query 11:55 - Grafana Rate Query 13:05 - Full Fledged Spring Boot DashBoard 14:22 - Conclusion Music Credits:- Sappheiros - Rain ---------------------------- https://soundcloud.com/sappheirosmusic/rain1 Social Media YouTube youtube.com/c/sappheiros Spotify goo.gl/hE9MDJ Twitter twitter.com/SappheirosMusic Instagram instagram.com/sappheirosmusic Facebook facebook.com/SappheirosMusic Discord discord.gg/Pk87yN9 ----------------------------------------------------------
How to Monitor Spring Boot Application With Prometheus and Grafana