Root cause #
Flakiness is invisible until it is measured over time. A single CI run only tells you pass or fail for that attempt; it cannot tell you that a test passed on its second retry 14% of the time this week. JavaScript test runners surface retries in their reporters, but that data evaporates when the job container is torn down. The fix is to emit a numeric metric at the moment of each retry decision, while the suite, branch, and environment context still exists in the runner process, and ship it to a store that aggregates across runs. Datadog is well suited because it accepts high-cardinality tags and lets you define an SLO directly over a metric, so the same flaky counter that you graph also powers the breach monitor.
The two transports each have a failure mode. dogstatsd is fire-and-forget UDP: zero latency, but a missing or unreachable Datadog Agent silently drops packets, so you never know a metric vanished. The metrics API is an authenticated HTTPS POST that returns a status code you can assert on, but it adds round-trip latency to every push and is rate-limited. On ephemeral CI runners with no Agent sidecar, the metrics API is usually the correct default.
Step-by-step fix #
1. Emit a tagged counter from the test reporter #
Hook into your runner’s reporter to increment a counter whenever a test passes only after a retry. Tag with the dimensions you will slice by later.
// reporters/datadog-flaky-reporter.js — Playwright custom reporter
// Trade-off: sending per-test keeps cardinality high; batch+flush at end of run
// to avoid one HTTPS POST per test (CI cost) while preserving suite tags.
import { client, v2 } from '@datadog/datadog-api-client';
const points = [];
export default class DatadogFlakyReporter {
onTestEnd(test, result) {
if (result.status === 'passed' && result.retry > 0) {
points.push({
metric: 'ci.tests.flaky',
type: 1, // count
points: [{ timestamp: Math.floor(Date.now() / 1000), value: 1 }],
tags: [
`suite:${test.parent.title}`,
`branch:${process.env.GIT_BRANCH}`,
`env:${process.env.TEST_ENV || 'ci'}`,
],
});
}
}
async onEnd() {
const configuration = client.createConfiguration();
const api = new v2.MetricsApi(configuration);
await api.submitMetrics({ body: { series: points } });
}
}
2. Prefer dogstatsd when an Agent is available #
If your runner already has the Datadog Agent (self-hosted runners, Kubernetes CI), dogstatsd avoids per-run HTTPS overhead entirely.
// Trade-off: UDP is zero-latency but silently drops if the Agent is down —
// only use when you control the runner and can health-check the Agent.
import { StatsD } from 'hot-shots';
const dogstatsd = new StatsD({ host: 'localhost', port: 8125, prefix: 'ci.' });
export function recordFlaky({ suite, branch, env }) {
dogstatsd.increment('tests.flaky', 1, [`suite:${suite}`, `branch:${branch}`, `env:${env}`]);
}
3. Submit from a GitHub Actions step #
When you cannot embed a reporter, post a parsed count from the workflow, reading the API key from the DATADOG_API_KEY secret.
# .github/workflows/test.yml
- name: Stream flakiness to Datadog
if: always()
env:
DD_API_KEY: ${{ secrets.DATADOG_API_KEY }}
BRANCH: ${{ github.ref_name }}
run: |
FLAKY=$(jq '[.suites[].specs[] | select(.tests[].results | length > 1)] | length' results.json)
# Trade-off: one POST per run keeps API rate-limit headroom vs per-test.
curl -sS -X POST "https://api.datadoghq.com/api/v2/series" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "Content-Type: application/json" \
-d "{\"series\":[{\"metric\":\"ci.tests.flaky\",\"type\":1,\"points\":[{\"timestamp\":$(date +%s),\"value\":${FLAKY}}],\"tags\":[\"branch:${BRANCH}\",\"env:ci\"]}]}"
4. Define the SLO and monitor #
Create a metric monitor in Datadog over the flakiness ratio. Define an SLO of, for example, 98% of executions passing without a retry, evaluated on a rolling 7-day window. Set the monitor to alert when the ratio of ci.tests.flaky to ci.tests.total crosses the breach threshold, grouped by suite. Grouping by suite means one noisy suite does not drown out an otherwise healthy signal — the same principle used when building a QA reliability dashboard in Grafana. For long-term trend analysis beyond the SLO window, pair this with historical flakiness tracking analytics.
Pitfalls #
- Unbounded tag cardinality. Tagging by
test_idorrun_idexplodes custom metric counts and your bill. Mitigation: tag by suite and branch only; keep run-level detail in logs. - Silent dogstatsd drops. A missing Agent loses every metric with no error. Mitigation: health-check the Agent in a CI pre-step, or fall back to the metrics API.
- Counting failures as flaky. A test that never passes is broken, not flaky. Mitigation: only increment when final status is
passedandretry > 0. - Branch tag noise from forks. Thousands of short-lived branches fragment the series. Mitigation: normalize feature branches to a
branch:featurebucket, keepmaindistinct. - Submitting metrics only on success. Using
if: success()hides the runs that mattered. Mitigation: useif: always()so failing pipelines still report.
Reliability targets #
| Metric | Target | Notes |
|---|---|---|
| Flaky execution rate | < 2% over 7d | ci.tests.flaky / ci.tests.total |
| SLO (pass without retry) | ≥ 98% | Rolling 7-day window, grouped by suite |
| Metric submission latency | < 500 ms p95 | Metrics API POST round trip |
| Monitor evaluation delay | < 5 min | Datadog metric monitor default |
main branch pass rate |
≥ 99% | Stricter gate than feature branches |
Frequently Asked Questions #
Q: Should I use dogstatsd or the metrics API from CI? A: Use the metrics API on ephemeral cloud runners with no Datadog Agent — it returns a status code you can assert on. Use dogstatsd only on self-hosted runners where you control and can health-check the Agent.
Q: How do I keep custom metric costs under control? A: Limit tags to low-cardinality dimensions (suite, branch bucket, env). Avoid per-test or per-run-id tags; route that granularity to logs instead, and Datadog will bill on the small tag combination set.
Q: Can a single counter power both the dashboard and the alert?
A: Yes. Define the SLO over the same ci.tests.flaky metric you graph, then attach a metric monitor to the SLO. One emission path keeps the alert and the chart consistent.