Observability Platforms: Overkill or Essential?


A salesperson from an observability vendor called me last month. Their platform could correlate traces across microservices, automatically detect anomalies with machine learning, and provide “full-stack visibility into distributed systems.”

The price tag? $180K annually for our environment. Plus implementation costs.

I asked what problem this solved that our current monitoring didn’t. The answer was mostly marketing speak about “modern architectures” and “cloud-native complexity.”

What Observability Actually Means

Observability became trendy around the same time everyone started breaking monoliths into microservices. Suddenly, a single user request might touch fifteen different services. Traditional monitoring—checking if servers are up, graphing response times—wasn’t enough.

You needed to trace requests across services. Understand which downstream call was slow. Correlate metrics with logs with traces. See the whole picture.

That’s genuinely useful. The question is whether you need an expensive platform to get there, or whether you can build something simpler.

The Three Pillars Trap

Observability vendors love talking about the “three pillars”: logs, metrics, and traces. You need all three, they say, and only their platform brings them together properly.

Here’s what they don’t tell you: most problems can be debugged with just logs and metrics. Distributed tracing is useful, but it’s not essential for every system.

If you’re running a monolith or a small number of services, you probably don’t need distributed tracing at all. You definitely don’t need machine learning to detect anomalies—your engineers will notice when things break.

What You Probably Actually Need

Start with structured logging. Not just printf debugging scattered through the code—actual structured logs with consistent fields. Request IDs. User IDs. Timestamps. Error details.

Push these to a central location. Elasticsearch works. CloudWatch Logs works. Even just grep-ing log files works if you’re not at massive scale.

Then add metrics. Response times. Error rates. Resource usage. Graph them. Set up alerts when they cross thresholds you care about. Prometheus is great for this. CloudWatch metrics work fine too.

That handles 80% of debugging scenarios. Some service is slow? Check the metrics to see which one. What happened? Check the logs for that time period. Filter by request ID if you need to trace a specific request.

When Tracing Makes Sense

Distributed tracing starts making sense when you’ve got complex service interactions. Ten or more services talking to each other in non-obvious ways. Requests that fan out to multiple backends before consolidating results.

In those environments, tracing is valuable. You can see that the checkout process is slow because the recommendation service is taking 2 seconds, which is slow because it’s waiting on the inventory service, which is slow because… you get the idea.

But you don’t necessarily need a vendor platform for this. OpenTelemetry has become the standard. Jaeger is open source and works well. If you’re on AWS, X-Ray is built in and costs a fraction of third-party platforms.

The Auto-Detection Myth

Every observability vendor promises automatic anomaly detection. Machine learning will spot problems before you do! You’ll know about issues before customers complain!

In practice, this generates a lot of alerts about things that don’t matter. Traffic patterns changed because it’s Monday morning and usage is different on Mondays. Response times increased because someone ran a big batch job. Memory usage spiked because of that perfectly normal nightly process.

You can tune these systems. But tuning them well takes as much time as just setting up straightforward threshold alerts on metrics you actually care about.

The Real Cost

Observability platforms aren’t just expensive to buy—they’re expensive to operate.

You need to instrument your code. That’s development time. You need to train your team on the platform. That’s time and often additional licensing for training materials. You need to tune the settings so you’re not drowning in alerts or missing important ones. That’s ongoing operational overhead.

And you need to manage the data volume. Observability platforms typically charge based on data ingestion. That microservice that logs every database query? That’s going to get expensive fast.

I’ve seen organisations spend more on observability than on the actual infrastructure being observed. That’s not inherently wrong—visibility is valuable—but you should at least be aware of the trade-off.

Build vs Buy

There’s a middle path between buying an enterprise platform and cobbling together open source tools with bash scripts.

Use managed versions of open source tools. Elastic Cloud hosts Elasticsearch for you. Grafana Cloud gives you hosted Prometheus and Grafana. These cost way less than Datadog or New Relic, and you get most of the functionality.

Or use what your cloud provider offers. CloudWatch is fine for many use cases. Azure Monitor works. Google Cloud Logging and Monitoring are solid. They’re not as feature-rich as dedicated platforms, but they’re cheap and they’re already integrated.

When Vendors Make Sense

I’m not saying never buy an observability platform. I’m saying be clear about what problem you’re solving.

If you’re a mid-sized engineering team managing complex distributed systems, a vendor platform might be worth it. The productivity gains from better debugging tools can justify the cost.

If you’re a startup with five engineers and a Django app, you don’t need Datadog’s full suite. CloudWatch logs and metrics will get you pretty far.

If you’re an enterprise with compliance requirements about log retention and audit trails, a vendor platform might simplify life. Or it might just add another vendor to manage.

The Features You’ll Actually Use

Here’s a test: ask your team what observability features they use daily.

The answer is probably: looking at graphs of key metrics, searching logs for errors, and maybe checking which service is slow when things break.

The fancy features—automatic root cause analysis, AI-powered insights, predictive alerting—sound great in demos. In practice, engineers usually know what’s wrong before the AI figures it out.

If you’re paying for features nobody uses, you’re probably overpaying.

Starting Small

If you don’t have observability tooling yet, start simple:

  1. Get your logs centralised and searchable
  2. Add metrics for critical paths (request rates, error rates, latency)
  3. Set up basic alerts on things that matter
  4. Add tracing only if debugging actually requires it

You can always add more later. It’s harder to scale back from a complex setup you’re not getting value from.

The Real Question

The question isn’t “do we need observability?” The question is “what’s the minimum observability we need to debug production issues effectively?”

Sometimes that’s a full vendor platform with all the bells and whistles. Usually it’s something simpler.

Start with simple. Add complexity only when the pain of not having it is clear. That way you’re solving real problems instead of imagined ones.

And if a salesperson tells you their platform is essential for modern applications, ask them what specific problems it solves that simpler tools don’t. If the answer is vague, so should your interest be.