What is O11y?
O11y stands for observability, and it talks about an ability to understand what is happening in our system or application by examining the system outputs, like data, logs, or metrics, etc. It enables not only engineers, but also other IT departments like DevOps, SRE teams, or even standby system support, to understand the behavior of a system/app, to know what gone wrong (if that happens), why it gone that way, find the root cause, and do action about it. O11y takes monitoring to higher level by some reasons even though they are similar.
Focus
Monitoring tracks predefined metrics and alerting to known issues while o11y focuses on understanding the state of the system by analyzing its outputs. It is said that monitoring deals with “known unknowns” while observability deals with “unknown unknowns”.
Approach
Monitoring is like “reactive” approach, we do something when something is wrong based on pre-set thresholds. However, observability is more like “proactive” approach, often incorporates extra situational data and historical data to analyze the root cause of “why” that monitoring alerts occurred.
Look at it this way, monitoring is like a car dashboard, showing us basic information of the car, like fuel level, how fast we run, even in modern car it tells more like tire pressure level, temperature, GPS, and so on. While observability is a car mechanic. Say that one day when you’re driving you feel something unusual with your car, assume it shown in the car dashboard. Oh, you see an alert. That’s monitoring. Then you take your car to the service center to have it checked. You tell the mechanic what happened. Then the mechanic use that information you gave him to diagnose what’s happening and how to fix it (by checking your car, of course). Now that is observability.
So we can see that monitoring and observability are not two different things, instead they collaborates. Monitoring and observability offer businesses complementary approaches to diagnosing system issues. Whereas monitoring tells teams when something is wrong, observability tells them what’s happening, why it’s happening and how to fix it.
Why it matters
First, user experience. Let me make this short, if you make a system or application and you don’t care about your users, then you probably don’t need observability, even you don’t need monitoring at all. It may sound stupid, but rationally, if you make something, you would want someone else to use it, to feel its usefulness, to make them want to use your product, right? So it definitely doesn’t make sense not to care about users. And that elaborated to how the users would feel using the product, what make them want to use it.
I’m not talking about the business value the product offers, like “Oh, it’s a banking app, so people would want to use it for financial purposes, saving money, investments, or credit cards or applying loans”. But if that banking app performance sucks, like “slow process when users doing fund transfer, or unknown error when making payment at supermarket, unreliable performance to inquiry account balance”, and there are other banks offer more reliable app as alternatives, which one do you think people will use?
This is rhetorical question, the answer is obvious, and now we understand the importance of ensuring the user experience of our application is crucial. It is like dead or alive choice for the business itself. And as a result, IT teams face mounting pressure to track and respond to issues much faster degree. This is where observability comes to play.
Second, tech evolves dramatically. As organizations embrace new tech concepts like distributed system, cloud, DevOps, bla bla bla, system architectures have dramatically increased in complexity and scale. Simple systems have fewer moving parts, making them easier to manage. Monitoring CPU, memory, databases and networking conditions is usually enough to understand these systems and apply the appropriate fix to a problem.
Distributed systems have a far higher number of interconnected parts, so the number and types of failures that can occur is higher too. Additionally, distributed systems are constantly updated, and every change can create a new type of failure. In a distributed environment, understanding a current problem is an enormous challenge, largely because it produces more “unknown unknowns” than simpler systems.
How to become observable
3 Pillars of Observability
Observability Tech
The distinction between metrics and tracing
By metrics, we specifically mean the class of information that allows you to reason about the performance of a system in the aggregate (across different components of a single app, instances of an in a cluster, clusters operating in different environments or regions, etc.).
Notably this excludes information intended to reason about the contribution of various components to the total latency of a single request as it passes through a series of services; this is the responsibility of distributed tracing collectors like Spring Cloud Sleuth, Zipkin’s Brave, etc.
Distributed tracing systems provide detailed information about subsystem latency, but generally downsample in order to scale (e.g. Spring Cloud Sleuth ships 10% of samples by default). Metrics data is generally pre-aggregated and so naturally lacks correlative information, but is also not downsampled. So, for a series of 100,000 requests in a minute interval that feature an interaction with service A and, depending on the input, maybe an interaction with service B:
- Metrics data will tell you that in the aggregate, service A’s observed throughput was 100k requests and service B’s observed throughput was 60k requests. Additionally, in that minute, service A’s max overall average latency was 100ms and service B’s max overall average latency was 50ms. It will also provide information on maximum latencies and other distribution statistics in that period.
- A distributed tracing system will tell you that for a particular request (but not the entire population of requests, because remember downsampling is happening), service A took 50 ms and service B took 90ms.
You might reasonably infer from the metrics data that roughly half the time spent in the worst-case user experience was spent in each of A and B, but you can’t be certain since you are looking at an aggregate, and it is entirely possible that in the worst case all 100ms was spent in service A and B was never called at all.
Conversely, from tracing data you cannot reason about throughput over an interval or the worst-case user experience.
Understanding Prometheus Metrics (label & attr)
Email Alert Scenario
Zipkin
What is Trace & Spans
Not useful in monolithic application? THINK AGAIN! Aren’t we all using API through HTTP at least? Then, it is important to also have monitoring against that.
Support scenario IRL
How we can utilize these technologies to solve production problem?