Streaming Flakiness Metrics to Datadog

Tagged flakiness metrics flow from CI into Datadog, where an SLO monitor evaluates the breach threshold.

Root cause #

Flakiness is invisible until it is measured over time. A single CI run only tells you pass or fail for that attempt; it cannot tell you that a test passed on its second retry 14% of the time this week. JavaScript test runners surface retries in their reporters, but that data evaporates when the job container is torn down. The fix is to emit a numeric metric at the moment of each retry decision, while the suite, branch, and environment context still exists in the runner process, and ship it to a store that aggregates across runs. Datadog is well suited because it accepts high-cardinality tags and lets you define an SLO directly over a metric, so the same flaky counter that you graph also powers the breach monitor.

The two transports each have a failure mode. dogstatsd is fire-and-forget UDP: zero latency, but a missing or unreachable Datadog Agent silently drops packets, so you never know a metric vanished. The metrics API is an authenticated HTTPS POST that returns a status code you can assert on, but it adds round-trip latency to every push and is rate-limited. On ephemeral CI runners with no Agent sidecar, the metrics API is usually the correct default.

Step-by-step fix #

1. Emit a tagged counter from the test reporter #

Hook into your runner’s reporter to increment a counter whenever a test passes only after a retry. Tag with the dimensions you will slice by later.

// reporters/datadog-flaky-reporter.js — Playwright custom reporter
// Trade-off: sending per-test keeps cardinality high; batch+flush at end of run
// to avoid one HTTPS POST per test (CI cost) while preserving suite tags.
import { client, v2 } from '@datadog/datadog-api-client';

const points = [];

export default class DatadogFlakyReporter {
  onTestEnd(test, result) {
    if (result.status === 'passed' && result.retry > 0) {
      points.push({
        metric: 'ci.tests.flaky',
        type: 1, // count
        points: [{ timestamp: Math.floor(Date.now() / 1000), value: 1 }],
        tags: [
          `suite:${test.parent.title}`,
          `branch:${process.env.GIT_BRANCH}`,
          `env:${process.env.TEST_ENV || 'ci'}`,
        ],
      });
    }
  }
  async onEnd() {
    const configuration = client.createConfiguration();
    const api = new v2.MetricsApi(configuration);
    await api.submitMetrics({ body: { series: points } });
  }
}

2. Prefer dogstatsd when an Agent is available #

If your runner already has the Datadog Agent (self-hosted runners, Kubernetes CI), dogstatsd avoids per-run HTTPS overhead entirely.

// Trade-off: UDP is zero-latency but silently drops if the Agent is down —
// only use when you control the runner and can health-check the Agent.
import { StatsD } from 'hot-shots';
const dogstatsd = new StatsD({ host: 'localhost', port: 8125, prefix: 'ci.' });

export function recordFlaky({ suite, branch, env }) {
  dogstatsd.increment('tests.flaky', 1, [`suite:${suite}`, `branch:${branch}`, `env:${env}`]);
}

3. Submit from a GitHub Actions step #

When you cannot embed a reporter, post a parsed count from the workflow, reading the API key from the DATADOG_API_KEY secret.

# .github/workflows/test.yml
- name: Stream flakiness to Datadog
  if: always()
  env:
    DD_API_KEY: ${{ secrets.DATADOG_API_KEY }}
    BRANCH: ${{ github.ref_name }}
  run: |
    FLAKY=$(jq '[.suites[].specs[] | select(.tests[].results | length > 1)] | length' results.json)
    # Trade-off: one POST per run keeps API rate-limit headroom vs per-test.
    curl -sS -X POST "https://api.datadoghq.com/api/v2/series" \
      -H "DD-API-KEY: ${DD_API_KEY}" \
      -H "Content-Type: application/json" \
      -d "{\"series\":[{\"metric\":\"ci.tests.flaky\",\"type\":1,\"points\":[{\"timestamp\":$(date +%s),\"value\":${FLAKY}}],\"tags\":[\"branch:${BRANCH}\",\"env:ci\"]}]}"

4. Define the SLO and monitor #

Create a metric monitor in Datadog over the flakiness ratio. Define an SLO of, for example, 98% of executions passing without a retry, evaluated on a rolling 7-day window. Set the monitor to alert when the ratio of ci.tests.flaky to ci.tests.total crosses the breach threshold, grouped by suite. Grouping by suite means one noisy suite does not drown out an otherwise healthy signal — the same principle used when building a QA reliability dashboard in Grafana. For long-term trend analysis beyond the SLO window, pair this with historical flakiness tracking analytics.

Pitfalls #

Unbounded tag cardinality. Tagging by test_id or run_id explodes custom metric counts and your bill. Mitigation: tag by suite and branch only; keep run-level detail in logs.
Silent dogstatsd drops. A missing Agent loses every metric with no error. Mitigation: health-check the Agent in a CI pre-step, or fall back to the metrics API.
Counting failures as flaky. A test that never passes is broken, not flaky. Mitigation: only increment when final status is passed and retry > 0.
Branch tag noise from forks. Thousands of short-lived branches fragment the series. Mitigation: normalize feature branches to a branch:feature bucket, keep main distinct.
Submitting metrics only on success. Using if: success() hides the runs that mattered. Mitigation: use if: always() so failing pipelines still report.

Reliability targets #

Metric	Target	Notes
Flaky execution rate	< 2% over 7d	`ci.tests.flaky / ci.tests.total`
SLO (pass without retry)	≥ 98%	Rolling 7-day window, grouped by suite
Metric submission latency	< 500 ms p95	Metrics API POST round trip
Monitor evaluation delay	< 5 min	Datadog metric monitor default
`main` branch pass rate	≥ 99%	Stricter gate than feature branches

Frequently Asked Questions #

Q: Should I use dogstatsd or the metrics API from CI? A: Use the metrics API on ephemeral cloud runners with no Datadog Agent — it returns a status code you can assert on. Use dogstatsd only on self-hosted runners where you control and can health-check the Agent.

Q: How do I keep custom metric costs under control? A: Limit tags to low-cardinality dimensions (suite, branch bucket, env). Avoid per-test or per-run-id tags; route that granularity to logs instead, and Datadog will bill on the small tag combination set.

Q: Can a single counter power both the dashboard and the alert? A: Yes. Define the SLO over the same ci.tests.flaky metric you graph, then attach a metric monitor to the SLO. One emission path keeps the alert and the chart consistent.