Comprehensive Observability using Prometheus, Zipkin & Grafana

What is O11y?

O11y stands for observability, and it talks about an ability to understand what is happening in our system or application by examining the system outputs, like data, logs, or metrics, etc. It enables not only engineers, but also other IT departments like DevOps, SRE teams, or even standby system support, to understand the behavior of a system/app, to know what gone wrong (if that happens), why it gone that way, find the root cause, and do action about it. O11y takes monitoring to higher level by some reasons even though they are similar.

Focus

Monitoring tracks predefined metrics and alerting to known issues while o11y focuses on understanding the state of the system by analyzing its outputs. It is said that monitoring deals with “known unknowns” while observability deals with “unknown unknowns”.

Approach

Monitoring is like “reactive” approach, we do something when something is wrong based on pre-set thresholds. However, observability is more like “proactive” approach, often incorporates extra situational data and historical data to analyze the root cause of “why” that monitoring alerts occurred.

Look at it this way, monitoring is like a car dashboard, showing us basic information of the car, like fuel level, how fast we run, even in modern car it tells more like tire pressure level, temperature, GPS, and so on. While observability is a car mechanic. Say that one day when you’re driving you feel something unusual with your car, assume it shown in the car dashboard. Oh, you see an alert. That’s monitoring. Then you take your car to the service center to have it checked. You tell the mechanic what happened. Then the mechanic use that information you gave him to diagnose what’s happening and how to fix it (by checking your car, of course). Now that is observability.

So we can see that monitoring and observability are not two different things, instead they collaborates. Monitoring and observability offer businesses complementary approaches to diagnosing system issues. Whereas monitoring tells teams when something is wrong, observability tells them what’s happening, why it’s happening and how to fix it.

Why it matters

First, user experience. Let me make this short, if you make a system or application and you don’t care about your users, then you probably don’t need observability, even you don’t need monitoring at all. It may sound stupid, but rationally, if you make something, you would want someone else to use it, to feel its usefulness, to make them want to use your product, right? So it definitely doesn’t make sense not to care about users. And that elaborated to how the users would feel using the product, what make them want to use it.

I’m not talking about the business value the product offers, like “Oh, it’s a banking app, so people would want to use it for financial purposes, saving money, investments, or credit cards or applying loans”. But if that banking app performance sucks, like “slow process when users doing fund transfer, or unknown error when making payment at supermarket, unreliable performance to inquiry account balance”, and there are other banks offer more reliable app as alternatives, which one do you think people will use?

This is rhetorical question, the answer is obvious, and now we understand the importance of ensuring the user experience of our application is crucial. It is like dead or alive choice for the business itself. And as a result, IT teams face mounting pressure to track and respond to issues much faster degree. This is where observability comes to play.

Second, tech evolves dramatically. As organizations embrace new tech concepts like distributed system, cloud, DevOps, bla bla bla, system architectures have dramatically increased in complexity and scale. Simple systems have fewer moving parts, making them easier to manage. Monitoring CPU, memory, databases and networking conditions is usually enough to understand these systems and apply the appropriate fix to a problem.

Distributed systems have a far higher number of interconnected parts, so the number and types of failures that can occur is higher too. Additionally, distributed systems are constantly updated, and every change can create a new type of failure. In a distributed environment, understanding a current problem is an enormous challenge, largely because it produces more “unknown unknowns” than simpler systems.

How to become observable

3 Pillars of Observability

Observability Tech

The distinction between metrics and tracing

By metrics, we specifically mean the class of information that allows you to reason about the performance of a system in the aggregate (across different components of a single app, instances of an in a cluster, clusters operating in different environments or regions, etc.).

⚠️

Notably this excludes information intended to reason about the contribution of various components to the total latency of a single request as it passes through a series of services; this is the responsibility of distributed tracing collectors like Spring Cloud Sleuth, Zipkin’s Brave, etc.

Distributed tracing systems provide detailed information about subsystem latency, but generally downsample in order to scale (e.g. Spring Cloud Sleuth ships 10% of samples by default). Metrics data is generally pre-aggregated and so naturally lacks correlative information, but is also not downsampled. So, for a series of 100,000 requests in a minute interval that feature an interaction with service A and, depending on the input, maybe an interaction with service B:

Metrics data will tell you that in the aggregate, service A’s observed throughput was 100k requests and service B’s observed throughput was 60k requests. Additionally, in that minute, service A’s max overall average latency was 100ms and service B’s max overall average latency was 50ms. It will also provide information on maximum latencies and other distribution statistics in that period.

A distributed tracing system will tell you that for a particular request (but not the entire population of requests, because remember downsampling is happening), service A took 50 ms and service B took 90ms.

You might reasonably infer from the metrics data that roughly half the time spent in the worst-case user experience was spent in each of A and B, but you can’t be certain since you are looking at an aggregate, and it is entirely possible that in the worst case all 100ms was spent in service A and B was never called at all.

Conversely, from tracing data you cannot reason about throughput over an interval or the worst-case user experience.

Understanding Prometheus Metrics (label & attr)

Email Alert Scenario

Zipkin

What is Trace & Spans

Not useful in monolithic application? THINK AGAIN! Aren’t we all using API through HTTP at least? Then, it is important to also have monitoring against that.

Support scenario IRL

How we can utilize these technologies to solve production problem?

References

Micrometer: Spring Boot 2's new application metrics collector

Level up your Java code and explore what Spring can do for you.

https://spring.io/blog/2018/03/16/micrometer-spring-boot-2-s-new-application-metrics-collector

How to Monitor Spring Boot Application With Prometheus and Grafana

In this video, we will be exploring how we can monitor our spring boot application using Prometheus and Grafana. We would gather information using Prometheus and then visualise the data on a Grafana dashboard. Here is is a link to a community provided dashboard: https://grafana.com/grafana/dashboards/6756 If you find problems working the dashboard, you can always refer to my Github repo. Link is present in the article on my website https://refactorfirst.com : https://refactorfirst.com/spring-boot-prometheus-grafana You can support me by buying me a coffee 😄 : https://www.buymeacoffee.com/amrutprabhu Gear I use: Sony Alpha a6000 : Amazon India : https://amzn.to/3RM7QMJ Germany : https://amzn.to/3G1Iw24 US : https://amzn.to/3cpKZ9E UK : https://amzn.to/3J0g5Ry Razer Microphone : Amazon India : https://amzn.to/3aQxI9C Germany : https://amzn.to/3DRRrAh US : https://amzn.to/3ITKu3Q UK : https://amzn.to/3v3RHbU -------Chapters----- 00:00 - Introduction 00:39 - Creating Application 01:25 - Code Walkthrough 02:04 - Prometheus Configuration 02:27 - Starting Application 02:40 - Understanding Prometheus Metrics Endpoint 03:30 - Prometheus Docker Compose Configuration 04:20 - Understanding Prometheus Configuration 05:40 - Understanding Node Exporter 06:10 - Starting Prometheus Docker Compose 06:32 - Using Prometheus Interface 06:42 - Understanding Prometheus Use 08:04 - Grafana Docker Compose Configuration 08:40 - Starting Grafana Docker Compose 09:11 - Adding Prometheus Data Source 09:35 - Creating Grafana Dashboard 09:45 - Creating and Understanding Grafana Query 10:28 - Creating Grafana DashBoard with Log Query 11:55 - Grafana Rate Query 13:05 - Full Fledged Spring Boot DashBoard 14:22 - Conclusion Music Credits:- Sappheiros - Rain ---------------------------- https://soundcloud.com/sappheirosmusic/rain1 Social Media YouTube youtube.com/c/sappheiros Spotify goo.gl/hE9MDJ Twitter twitter.com/SappheirosMusic Instagram instagram.com/sappheirosmusic Facebook facebook.com/SappheirosMusic Discord discord.gg/Pk87yN9 ----------------------------------------------------------

https://www.youtube.com/watch?v=pVdDWQQeqME

Grafana : Setup Grafana for Spring Boot app | Actuator, Prometheus & Grafana | Monitoring & Alerting

After deploying any application in production, our main concerns becomes to provide the best user experience to end user. For that application monitoring and alerting is an important aspect. When we say application monitoring and alerting grafana comes into the picture. Grafana is a very customizable tool, to monitor any application. Grafana setup is very easy. And, it supports multiple data source. In this video, we will setup grafana for monitoring a spring boot web application. For that we will use spring boot actuator, prometheus and grafana. This video will cover: 1. What is grafana? 2. What is the benefit of using grafana? 3. Set up grafana for a spring boot application from scratch. #channelcodeboard #grafana #devops #applicationmonitoring #alerting grafana,grafana tutorial,grafana tutorial for beginners,grafana dashboard,setup your own grafana dashboard,continuous monitoring with grafana,grafana dashboard creation,learn grafana,learn grafana dashboard,grafana alerting tutorial,continuous monitoring grafana tutorial,devops,getting started with grafana,grafana tool devops,grafana devops tutorial,devops, devops training, java, spring boot, spring, alerting, email alert, prometheus, actuator

https://www.youtube.com/watch?v=gJZhdEJvZmc&t=793s

[Episode 51] Observing a Spring Boot Application with Actuator, Prometheus, Grafana, and Zipkin

🔗 GitHub: https://github.com/Washingtonwei/hogwarts-artifacts-online/tree/observability-stack In the last video, we showcased the capabilities of Spring Boot Actuator. It exposes operational information about the running application – like health, info, metrics, env, etc. Now, let’s build a comprehensive monitoring and observability stack (Prometheus, Grafana, Mailpit, and Zipkin) for our Spring Boot application. 0:00:00 Introduction 0:00:38 Observability Stack 0:02:54 Setting up Prometheus in Spring Boot 0:05:36 Defining and running Prometheus, Grafana, Mailpit, and Zipkin using Docker Compose 0:13:38 Prometheus 0:17:08 Grafana 0:24:58 Defining alert rules in Grafana and sending alert emails to Mailpit 0:36:21 Zipkin 0:43:26 Micrometer Observation API @Observed 0:48:05 Deploying a Spring Boot application to Azure 0:50:50 Synchronizing the local main branch 0:51:21 Summary This video belongs to a playlist "Learn Spring Boot 3 with Bingyang": https://www.youtube.com/playlist?list=PLqq9AhcMm2oPdXXFT3fzjaKLsVymvMXaY I don't want to make a super long video so I split it into smaller ones. If you want to learn Spring Boot 3 systematically, please visit the playlist (and save the playlist) for more Spring Boot 3 tutorials. Feel free to leave a comment if you have any questions. As always, thanks for watching, and happy learning!

https://www.youtube.com/watch?v=aqwGRRgiMPI

Microservice | Distributed log tracing using Spring Cloud Sleuth & Zipkin | PART-7 | Javatechie

This tutorial will walk you through the steps of building a spring boot project with Microservice architecture also we will learn Real time integration of 1.spring cloud eureka 2.Spring cloud API Gateway 3. Hystrix 4.cloud config server 5. ELK Stack Centralize logging 6. Spring Cloud Sleuth & Zipkin (Distributed log tracing) #javatechie #SpringBoot #Microservice #SpringCloud #Sleuth #zipkin GitHub: https://github.com/Java-Techie-jt/spring-cloud-gatway-hystrix Blogs: https://medium.com/@javatechie Facebook: https://www.facebook.com/groups/919464521471923 Guys if you like this video please do subscribe now and press the bell icon to not miss any update from Java Techie Disclaimer/Policy: -------------------------------- Note : All uploaded content in this channel is mine and its not copied from any community , you are free to use source code from above mentioned GitHub account

https://youtu.be/M19XC0zJUrA?si=vAKv123QJrLMfcJU&t=600

How observability is redefining the roles of developers - Stack Overflow

You’re tracking a bug through production. You look through the logs. The one thing you need isn’t there… Dead end. A few years ago, I was tracking a production issue with a server that triggered a request to a database read due to cache misses. This skyrocketed our cost due to high read volume. Unfortunately, there was no way to know what triggered this since there was no logging on cache misses.

https://stackoverflow.blog/2022/07/18/how-observability-is-redefining-the-roles-of-developers/

Observability vs. Monitoring: What's the Difference? | IBM

Both monitoring and observability help IT teams assess system health and ensure network and application performance. Learn how they differ, and how they are used together to optimize performance management.

https://www.ibm.com/think/topics/observability-vs-monitoring