Observability Platforms: Datadog vs Splunk vs Open Source


Our observability costs tripled over eighteen months. Not because we tripled our infrastructure. Because we weren’t paying attention to how monitoring costs scale, and by the time we noticed, we were paying six figures annually for metrics, logs, and traces.

That forced a comprehensive evaluation of observability platforms. We looked at staying with our current vendor, switching to competitors, and building our own using open-source tools. The process taught me a lot about how these platforms actually work and what drives their economics.

How We Got Here

We started with Datadog five years ago when we were much smaller. It was easy to set up, the UI was good, and the pricing seemed reasonable. A few hundred dollars a month for monitoring wasn’t worth optimizing.

Then we grew. More services. More infrastructure. More logs. Datadog’s pricing is mostly based on volume, so as our infrastructure scaled, costs scaled with it. By the time we noticed, we were spending $9K monthly and the trajectory was pointing toward $15K within a year.

The breaking point was when we enabled APM (application performance monitoring) and log indexing for a major application. Datadog’s bill jumped $3K in one month. We hadn’t understood that indexed log pricing was orders of magnitude more expensive than metric pricing.

The Evaluation Process

We looked at three directions: optimize our current Datadog usage, switch to a competitor like Splunk or Elastic, or build our own observability stack using open-source tools.

For competitive evaluation, we considered Splunk, New Relic, Dynatrace, Elastic, and Grafana Cloud. For open-source, we looked at Prometheus for metrics, Loki for logs, Tempo for traces, and Grafana for visualization.

The evaluation criteria were cost, features, operational overhead, and data retention. We needed metrics, logs, and distributed tracing. We needed integration with AWS, Kubernetes, and about 30 different application frameworks. We needed reasonable query performance and at least 30 days of data retention.

The Datadog Optimization Path

Before switching platforms, we tried to optimize Datadog usage. Turns out there’s a lot you can do if you actually understand their pricing model.

Log management was the biggest cost driver. We were indexing everything, which is expensive. Datadog charges by indexed volume, not ingested volume. You can ingest logs for cheap, but the moment you index them for searching, costs jump dramatically.

We moved to a pattern where only error-level logs get indexed automatically. Everything else gets ingested for archival but not indexed unless we explicitly need to search it. This cut log costs by about 70% while maintaining the ability to investigate issues.

Metric optimization was less dramatic but still meaningful. We were collecting metrics at 10-second resolution for everything. Most metrics don’t need that granularity. Changing default collection to 60 seconds for most services reduced metric volume by about 80% with no practical impact on observability.

APM spans were another cost driver. Distributed tracing generates enormous data volume. We changed from tracing 100% of requests to trace sampling at 5% for normal operations and 100% for errors. This maintained debugging capability while reducing trace costs significantly.

After optimization, we cut our Datadog bill from $9K to about $4.5K monthly. That’s still expensive, but more reasonable.

The Splunk Comparison

Splunk’s model is different. They price primarily on log volume ingested per day, not indexed volume. This is simpler in some ways but means you pay up front for all logs regardless of whether you search them.

For our volume, Splunk would have been roughly comparable to optimized Datadog costs. Slightly cheaper for logs, slightly more expensive for metrics and APM. Close enough that pricing wasn’t a deciding factor.

The Splunk advantage is that if you already use Splunk for other purposes (security monitoring, business analytics), consolidating onto one platform has operational benefits. We didn’t have existing Splunk investment, so this didn’t apply.

The Splunk disadvantage is that the product feels dated. The UI is functional but clunky compared to modern tools. The query language is powerful but has a steep learning curve. For a team used to Datadog’s interface, this felt like a step backward.

The Open Source Option

Building our own observability stack using Prometheus, Loki, Tempo, and Grafana was the most interesting option. The software is free. We’d only pay for infrastructure to run it.

We did a proof of concept. Set up Prometheus for metrics, Loki for logs, Tempo for traces, all visualized through Grafana. For a week, we ran it in parallel with Datadog to compare.

The good: it worked. Query performance was acceptable. The Grafana dashboards were flexible and actually better than Datadog’s in some ways. Infrastructure costs were maybe $500/month for the underlying EC2 instances and storage.

The bad: operational overhead. These are separate systems that need to be deployed, configured, upgraded, and monitored themselves. We needed to build integration with our existing service discovery and deployment systems. We needed to figure out long-term storage strategies for metrics and logs.

The team estimated two engineers spending about 20% of their time ongoing to maintain the platform. That’s maybe $40K annually in labor cost, which makes the $4.5K/month Datadog bill look more reasonable when you include labor.

What We Decided

We stayed with Datadog, but with much more aggressive cost management. The optimizations we implemented brought costs down enough that the managed service was worth it compared to operational overhead of open-source alternatives.

The key insight was that observability is infrastructure we rely on constantly. When something breaks, we need monitoring to be working. Having a vendor responsible for uptime and support has value. The open-source approach meant we’d be responsible for keeping our own monitoring systems running, which felt risky.

But we changed how we use Datadog. We treat it as an expensive resource and optimize accordingly. Log indexing is reserved for important logs. Metric resolution is tuned appropriately. APM tracing uses sampling. We review costs monthly and adjust collection policies when costs creep up.

Lessons About Observability Pricing

All observability vendors have complex pricing models based on volume of data. The unit economics favor them because costs scale with infrastructure growth, which means as your business succeeds, monitoring costs automatically increase.

Understanding the pricing model for whatever platform you use is critical. The difference between $9K/month and $4.5K/month was just configuration changes based on understanding what we actually needed versus what we were collecting by default.

Log management is almost always the biggest cost driver if you index everything. Most logs don’t need to be searchable. They just need to exist in case you need them later. Structuring your log management around that principle saves huge amounts of money.

Open-source observability is viable, but only if you have the engineering capacity to operate it. For small teams or organizations without existing expertise in these tools, managed services are usually worth the premium.

The Vendor Lock-in Question

One concern with managed observability platforms is vendor lock-in. Once you’ve built dashboards, alerts, and workflows around Datadog or Splunk, switching is painful.

This is real but probably overstated. The underlying data formats are mostly standard. Metrics are metrics. Logs are logs. Traces follow OpenTelemetry standards. You can export data if you need to.

The lock-in is more about operational knowledge and workflow. Your team learns one platform’s query language, dashboard syntax, and alert configuration. Switching means relearning all of that and rebuilding operational muscle memory. That’s friction, but it’s not insurmountable.

We treat observability as a long-term platform choice but not a permanent one. If costs become unreasonable or a significantly better option emerges, we’d switch. But we’re not actively maintaining flexibility just for its own sake.

The Right Answer for Your Organization

For small organizations with limited infrastructure, start with a managed service like Datadog, New Relic, or Grafana Cloud. The operational simplicity is worth paying for. Just understand the pricing model and don’t default to collecting everything at maximum resolution.

For large organizations with dedicated platform teams, open-source observability makes sense. You have the expertise to operate it and the scale where cost savings matter. Companies like Uber and Netflix run their own observability infrastructure for good reasons.

For mid-sized organizations like ours, it’s a judgment call. Managed services are expensive but reduce operational burden. Open-source is cheaper but requires engineering investment. The right answer depends on your specific costs, team capabilities, and tolerance for operational complexity.

What definitely doesn’t work is ignoring observability costs until they become painful. That’s how you end up with six-figure annual bills that could have been half that with better configuration. Whether you choose managed services or open-source, you need to understand the cost drivers and optimize accordingly.

We learned this the expensive way. Hopefully you won’t have to.