Observability is not optional anymore

For most of my career, logging was that thing you added when something goes wrong. Logs were an afterthought, for debugging, not design. Metrics meant a financial chart somewhere — a separate module, never part of the code you were actually writing. Tracing was never heard of.

That era is over.

As developers started coding less and AI started coding more — not just converting user stories to code blocks, but taking over design thinking and producing functional outputs directly — and as our architectures get more distributed, more event-driven, and more autonomous, observability has quietly become one of the most critical engineering disciplines of our time. Not just operationally, but strategically.

“You cannot improve what you cannot see. You cannot explain what you cannot understand. And in AI-augmented systems, you cannot trust what you cannot explain. Seeing is Observing.”

The old model of monitoring is broken

Traditional monitoring worked on a simple assumption: you knew in advance what could go wrong and what you wanted to monitor. So you set thresholds. High memory? Slow response? Database down? Send an alert.

When you have microservices talking across async queues, Service Bus sessions with thousands of messages, twenty SOAP calls to external systems, third-party integrations throwing unknown errors — and now you are using AI agents somewhere in that chain — your known failure space is a fraction of what can actually go wrong.

This is precisely why the three pillars of observability — logs, metrics, and traces — matter more now than ever. Dashboards are what make all of it viewable, observable, configurable and actionable. Not as separate tools bolted together, but as a unified, correlated signal that tells you what your system actually did, not just whether it crossed a threshold.

What AI adds to the complexity equation

When you introduce AI into production systems — whether it’s just a coding assistant generating user stories, unit tests, or algorithms, or something far more critical like an LLM powering a clinical decision assistant, an ML model predicting crew disruptions, or an autonomous agent handling patient intake — you inherit a new class of operational problems:

Non-determinism— AI doesn’t always give the same answer to the same question. Ask it twice, get two different outputs. You need to watch trends, not individual responses.
Latency that compounds— One AI call takes 2 seconds. Chain three of them in a workflow and your user is waiting 6 seconds. Chain five and it’s a timeout. You need to see the full journey — what called what, in what order, and how long each step took.
Silent failures— Have you ever seen an AI agent say ‘I don’t know’? AI is always confident. The problem is it’s confident in wrong answers too. A crashed system is easy to detect. An AI system that quietly gives wrong answers is not. No error. No alert. Just bad outcomes downstream. A wrong drug dosage suggestion. A wrong crew assignment. A wrong patient record match. You won’t know until something breaks at the business level.
Cost as a system signal — Every AI call costs money. Token by token. Call by call. If your workflow is calling an LLM 50 times when it should call it 5, your bill explodes. And your system is probably slow too. Cost is now a health metric, not just a finance problem. You need to observe it.
Risk of looking like an idiot— Have you ever been in a meeting where your tech lead asks you what exactly happened in that workflow? What did you code? What function you changed? Without observability, you don’t have an answer. You were flying blind and now everyone knows it.

None of this is hypothetical. I’ve seen workflows where an AI component was working correctly in isolation, but when you integrate it, it fails. It fails silently. No alerts. No errors. Just wrong outcomes. The only way to catch it was distributed tracing that correlated the full request or full API call or full functional component outcome.

OpenTelemetry is becoming the lingua franca

If you’re not already familiar with OpenTelemetry, it’s worth understanding what it actually solves. OpenTelemetry standardizes the instrumentation layer. You instrument once — your NestJS services or any stack you use — and you can ship that telemetry to any backend: Grafana, Datadog, Jaeger, whatever you choose.

Pair this with the Grafana stack — Loki for logs, Tempo for traces, Prometheus for metrics — and you have a complete, open-source observability platform that scales from a startup’s Kubernetes cluster to enterprise production. The tooling is mature, the community is large, and the cost model is far more predictable than proprietary alternatives.

I saw a role called ‘Observability Engineering Manager’ some time back. I thought it was something like ‘Vibe Coding Cleanup Specialist’ 😄 Nah. It’s real.

Here’s the shift I want to advocate for: stop treating observability as a separate discipline. Start treating it as a product engineering discipline — an integral part of Full Stack. Don’t be surprised when observability shows up as a constant requirement in Full Stack job listings.

Observability solves not just infrastructure questions. It solves design questions, functional questions, workflow questions and architectural questions. Teams that embed it early — while writing the code, not after the incident — are the ones building systems they can actually trust.

The career angle (yes, I’m going there)

As someone who has worked across multiple domains like aviation systems, healthcare and enterprise integrations, I’ve noticed a consistent gap: most senior engineers have strong opinions about system design, but very few have deep, practical observability experience. I include myself in that.

That gap is widening as AI adds new observability requirements. And in a world where organizations are deploying AI systems into production and hoping for the best, “I can tell you exactly what’s happening and why” is a powerful position to be in. As of today, it sounds fancy too.. An added bonus.

If you’re thinking about upskilling, the OTel + Grafana stack is a practical, vendor-neutral starting point. The fastest path is to instrument something real, set up a real dashboard, and chase a real problem.

The bottom line

In the age of AI, a system you can’t observe is a system you can’t trust. And a system you can’t trust is a liability, not a product. Observability is how you close that gap — methodically, systematically, and early.

What’s your experience with observability in production AI systems? Would love to hear what’s working — and what isn’t.

#Observability#OpenTelemetry#AIEngineering#Grafana#SoftwareArchitecture#FullStack

Found this useful?

Let’s connect on LinkedIn →

Observability is not optional anymore — especially in the age of AI

The old model of monitoring is broken

What AI adds to the complexity equation

OpenTelemetry is becoming the lingua franca

The career angle (yes, I’m going there)

The bottom line