Key Challenges & Context: Observability Spend Outpaced Value

By Q2 2025, our client (a cloud-based communications provider in the SaaS industry) Datadog bill jumped 30% in one quarter to $360,000. Spend outpaced the value teams got from monitoring. No one owned retention, tagging, or custom metrics. Procurement and FinOps lacked predictability and Engineering needed clear signals without extra manual work.

They ran into the same issues across services:

  • Kept everything. Teams kept info logs for 15 days when daily work needed far less.
  • Multiplied custom metrics. Teams created duplicate or rarely queried series.
  • Let tagging drift. Inconsistent tags blocked cost attribution and cross-team queries.
  • Worked from different playbooks. Each team ran Datadog its own way, so fixes did not scale.
  • Drowned in alerts. Noisy rules buried real signals and slowed triage.

They tried quick fixes: teams lowered some log levels, trimmed dashboards, and deleted a few metrics. Without shared rules or ownership, changes faded and costs kept rising.

The impact hit the business:

  • Monthly charges shifted, which made planning hard for Procurement and FinOps.
  • Engineers spent hours cleaning data instead of shipping features.
  • Blind cuts threatened monitoring coverage during incidents.

Our client needed a simple, shared way to control Datadog costs while keeping the coverage that protects operations.

Approach: A Six-Week Datadog Cost Audit That Kept Signal and Cut Waste

We built a six-week, low-risk audit that brought Engineering, SRE, and FinOps to the same table. Our client chose us for a clear promise: quantify savings, keep coverage, and leave a simple governance model that teams can run.

Strategy at a glance

  • Start with facts, not guesses: measure where costs accrue by data type, service, and environment.
  • Fix the foundations first: tagging standards, ownership, and retention rules.
  • Pilot before scale: prove no loss of coverage on a non-critical path, then expand.
  • Make changes stick: assign owners, create a light review cadence, and document how to maintain gains.

Step 1: Discovery
We mapped Datadog usage across logs, metrics, APM, RUM, and synthetics. We isolated top cost drivers, high-cardinality sources, and “keep everything” retention habits. We documented dashboards, alerts, and incident workflows that depended on the data, so savings would not break run operations.

Step 2: Assessment
We broke down spend by team, service, and environment. We quantified “what-if” scenarios for retention, rehydration, and metric cleanup, including the run-rate impact per action. We highlighted where tagging gaps blocked cost attribution and made cross-team queries harder than they should be.

Step 3: Recommendations
We delivered a ranked backlog with savings per item, effort, risk, and owner. The plan focused on:

  • Retention tuning: reduce info log retention from 15 to 7 days where daily work required less, keep error logs longer, and introduce tiers by environment.
  • Rehydration rules: move colder data to cheaper storage and restore only what audits or incidents actually need.
  • Custom metric cleanup: remove unused and duplicate series, and address high-cardinality patterns. Align remaining series to SLIs and SLOs.
  • Tagging standards: require service, team, environment, and ownership tags to improve queries, dashboards, and cost allocation.

Step 4: Enablement and rollout
We ran short workshops on log patterns, metrics hygiene, and rehydration. We set up a pilot on a non-critical service to validate that changes did not harm visibility. After the pilot, we expanded to high-impact services and applied the same rules. We integrated ticketing with Jira Service Desk so action items had owners and deadlines.

Obstacles we addressed

  • Fear of losing observability: We compared before-and-after dashboards, sampled error paths, and validated alert behavior during the pilot. Teams saw that retention tuning did not remove the signals they rely on.
  • Fragmented ownership: We named owners per service and per action, then kept momentum with short daily check-ins.
  • High-cardinality data in legacy services: We added filters at the source and updated libraries to curb explosive label growth without breaking parsing or search.
  • Downstream dependencies: We coordinated with analytics stakeholders to protect required feeds or replace them with cheaper sources.

Results: Measurable Savings and a Stronger Observability Framework

Our client cut Datadog spend without losing monitoring coverage. The audit produced clear gains that teams and Finance can track month after month.

  • Lower run rate: $12,000 per month in actionable savings now, with up to $30,000 per month identified.
  • Annual impact: $112,000 per year from log retention tuning alone.
  • Stronger unit economics: The rehydration approach stays profitable even at a 250× increase in rehydration use.
  • Less noise, faster triage: Fewer non-actionable logs and cleaner alert rules helped engineers focus on real signals.
  • Clear ownership: Tagging standards and named owners made spend attribution and reviews straightforward.
  • Adoption at scale: 9 of 11 recommendations accepted and implemented or scheduled for Q3.

Ready to Reduce your Datadog spend?

If your Datadog bill is rising faster than the value you get, schedule a 6-week Datadog cost optimization audit. We will quantify savings, protect monitoring quality, and hand over a clear plan your Engineering and FinOps teams can run.

Contact our experts to book a short discovery call.

Share
Insights

Access related expert insights

Expert Articles
Expert Articles
17 Apr 2026
SEO meta title: The hidden cost of routine customer queries in retail What “routine” really means in retail customer service In retail, “routine” doesn’t mean “easy.” It means repeatable. WISMO (Where Is My Order), returns, delivery changes, missing items: these are predictable intents. But they often involve multiple systems, policy rules, and exceptions. That’s why […]
The Hidden Cost Of Routine Customer Queries In Retail
The Hidden Cost Of Routine Customer Queries In Retail
Expert Articles
Expert Articles
14 Apr 2026
The race to adopt artificial intelligence has moved faster than almost any technological shift in history. According to McKinsey’s 2025 State of AI report, 88% of organizations have now integrated AI into at least one business function – a significant jump from just 78% a year prior. While generative AI adoption has more than doubled […]
AI Governance in APAC: The Executive’s Blueprint for Digital Trust
AI Governance in APAC: The Executive’s Blueprint for Digital Trust
Case Studies
Case Studies
10 Apr 2026
CBTW helped Finacca modernize its investigation platform by replacing a legacy ERP hosted in its Paris offices with a scalable Mendix low-code solution. Starting with an MVP dedicated to dormant life-insurance investigations, the team accelerated development, improved investigator workflows, and built the foundation for a broader digital platform. The result: faster case management, secure cloud access, and an architecture ready to support new applications and future AI-driven capabilities.
How Finacca Modernized its Life-Insurance Investigation Platform
How Finacca Modernized its Life-Insurance Investigation Platform