Key Challenges & Context: Observability Spend Outpaced Value
By Q2 2025, our client (a cloud-based communications provider in the SaaS industry) Datadog bill jumped 30% in one quarter to $360,000. Spend outpaced the value teams got from monitoring. No one owned retention, tagging, or custom metrics. Procurement and FinOps lacked predictability and Engineering needed clear signals without extra manual work.
They ran into the same issues across services:
- Kept everything. Teams kept info logs for 15 days when daily work needed far less.
- Multiplied custom metrics. Teams created duplicate or rarely queried series.
- Let tagging drift. Inconsistent tags blocked cost attribution and cross-team queries.
- Worked from different playbooks. Each team ran Datadog its own way, so fixes did not scale.
- Drowned in alerts. Noisy rules buried real signals and slowed triage.
They tried quick fixes: teams lowered some log levels, trimmed dashboards, and deleted a few metrics. Without shared rules or ownership, changes faded and costs kept rising.
The impact hit the business:
- Monthly charges shifted, which made planning hard for Procurement and FinOps.
- Engineers spent hours cleaning data instead of shipping features.
- Blind cuts threatened monitoring coverage during incidents.
Our client needed a simple, shared way to control Datadog costs while keeping the coverage that protects operations.
Approach: A Six-Week Datadog Cost Audit That Kept Signal and Cut Waste
We built a six-week, low-risk audit that brought Engineering, SRE, and FinOps to the same table. Our client chose us for a clear promise: quantify savings, keep coverage, and leave a simple governance model that teams can run.
Strategy at a glance
- Start with facts, not guesses: measure where costs accrue by data type, service, and environment.
- Fix the foundations first: tagging standards, ownership, and retention rules.
- Pilot before scale: prove no loss of coverage on a non-critical path, then expand.
- Make changes stick: assign owners, create a light review cadence, and document how to maintain gains.
Step 1: Discovery
We mapped Datadog usage across logs, metrics, APM, RUM, and synthetics. We isolated top cost drivers, high-cardinality sources, and “keep everything” retention habits. We documented dashboards, alerts, and incident workflows that depended on the data, so savings would not break run operations.
Step 2: Assessment
We broke down spend by team, service, and environment. We quantified “what-if” scenarios for retention, rehydration, and metric cleanup, including the run-rate impact per action. We highlighted where tagging gaps blocked cost attribution and made cross-team queries harder than they should be.
Step 3: Recommendations
We delivered a ranked backlog with savings per item, effort, risk, and owner. The plan focused on:
- Retention tuning: reduce info log retention from 15 to 7 days where daily work required less, keep error logs longer, and introduce tiers by environment.
- Rehydration rules: move colder data to cheaper storage and restore only what audits or incidents actually need.
- Custom metric cleanup: remove unused and duplicate series, and address high-cardinality patterns. Align remaining series to SLIs and SLOs.
- Tagging standards: require service, team, environment, and ownership tags to improve queries, dashboards, and cost allocation.
Step 4: Enablement and rollout
We ran short workshops on log patterns, metrics hygiene, and rehydration. We set up a pilot on a non-critical service to validate that changes did not harm visibility. After the pilot, we expanded to high-impact services and applied the same rules. We integrated ticketing with Jira Service Desk so action items had owners and deadlines.
Obstacles we addressed
- Fear of losing observability: We compared before-and-after dashboards, sampled error paths, and validated alert behavior during the pilot. Teams saw that retention tuning did not remove the signals they rely on.
- Fragmented ownership: We named owners per service and per action, then kept momentum with short daily check-ins.
- High-cardinality data in legacy services: We added filters at the source and updated libraries to curb explosive label growth without breaking parsing or search.
- Downstream dependencies: We coordinated with analytics stakeholders to protect required feeds or replace them with cheaper sources.
Results: Measurable Savings and a Stronger Observability Framework
Our client cut Datadog spend without losing monitoring coverage. The audit produced clear gains that teams and Finance can track month after month.
- Lower run rate: $12,000 per month in actionable savings now, with up to $30,000 per month identified.
- Annual impact: $112,000 per year from log retention tuning alone.
- Stronger unit economics: The rehydration approach stays profitable even at a 250× increase in rehydration use.
- Less noise, faster triage: Fewer non-actionable logs and cleaner alert rules helped engineers focus on real signals.
- Clear ownership: Tagging standards and named owners made spend attribution and reviews straightforward.
- Adoption at scale: 9 of 11 recommendations accepted and implemented or scheduled for Q3.
Ready to Reduce your Datadog spend?
If your Datadog bill is rising faster than the value you get, schedule a 6-week Datadog cost optimization audit. We will quantify savings, protect monitoring quality, and hand over a clear plan your Engineering and FinOps teams can run.
Contact our experts to book a short discovery call.