Root cause #
Flakiness is statistical, so any metric computed from one build is noise. A test that fails 1 run in 50 looks identical to a hard failure if you only see the failing build. The data you need already exists — every CI run emits a JUnit XML or JSON report listing each test’s outcome and whether a retry recovered it — but it is scattered across thousands of ephemeral artifacts that disappear when the runner is recycled.
The job is therefore aggregation, not detection. You persist each run’s per-test outcomes into a durable store keyed by a stable test identifier (file plus full title), then compute flake rate over a rolling window of recent runs. The standard definition: flake rate is the number of runs in which a test produced a flaky outcome (failed at least once, then passed on a retry) divided by the total number of runs of that test in the window. A rolling window keeps the number responsive — a test fixed last week should not be penalized by failures from a month ago.
Step-by-step fix #
1. Normalize each run’s report into rows #
Parse the JUnit XML or runner JSON into one record per test per run, with a stable id and a status of passed, failed, or flaky.
// ingest.js — normalize one run into rows for the store
const { XMLParser } = require('fast-xml-parser');
const fs = require('fs');
function toRows(xmlPath, runId, ts) {
const doc = new XMLParser({ ignoreAttributes: false }).parse(fs.readFileSync(xmlPath));
const cases = [].concat(doc.testsuites?.testsuite ?? []).flatMap((s) => [].concat(s.testcase ?? []));
return cases.map((c) => ({
run_id: runId,
ts,
test_id: `${c['@_classname']}::${c['@_name']}`,
// 'flaky' = recovered on retry; JUnit marks reruns with a 'flakyFailure' node.
status: c.flakyFailure ? 'flaky' : c.failure ? 'failed' : 'passed',
}));
}
module.exports = { toRows };
// Trade-off: keying on classname+name is stable across runs but breaks
// if you rename a test; treat a rename as a new id and accept a reset.
2. Persist rows to a durable store #
Append rows to a table that outlives the runner. One row per test per run keeps the window query simple.
-- schema.sql
CREATE TABLE IF NOT EXISTS test_runs (
run_id TEXT NOT NULL,
ts TIMESTAMP NOT NULL,
test_id TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('passed','failed','flaky'))
);
-- Index the window query path: most-recent rows per test.
CREATE INDEX IF NOT EXISTS idx_runs_test_ts ON test_runs (test_id, ts DESC);
-- Trade-off: storing every row is simple and auditable but grows fast;
-- prune rows older than the longest window you query (CI storage cost).
3. Compute flake rate over a rolling window #
Aggregate the last N runs per test. Flake rate is flaky outcomes over total runs in the window.
-- flake_rate.sql — rolling 50-run window per test
WITH windowed AS (
SELECT test_id, status,
ROW_NUMBER() OVER (PARTITION BY test_id ORDER BY ts DESC) AS rn
FROM test_runs
)
SELECT test_id,
COUNT(*) AS runs,
SUM(CASE WHEN status = 'flaky' THEN 1 ELSE 0 END) AS flaky_runs,
ROUND(100.0 * SUM(CASE WHEN status = 'flaky' THEN 1 ELSE 0 END) / COUNT(*), 2) AS flake_rate_pct
FROM windowed
WHERE rn <= 50 -- rolling window of the last 50 runs
GROUP BY test_id
HAVING COUNT(*) >= 10 -- ignore tests with too little history to be meaningful
ORDER BY flake_rate_pct DESC;
-- Trade-off: a 50-run window reacts fast but is jumpy for rarely-run
-- tests; widen it for low-frequency suites to avoid noisy percentages.
4. Or aggregate in Node when there is no database #
For small teams a JSON ledger and an in-memory reduce is enough. Same formula, no SQL.
// flake-rate.js — compute per-test flake rate from a JSON ledger
const rows = require('./test_runs.json'); // [{test_id, ts, status}, ...]
const WINDOW = 50;
const byTest = new Map();
for (const r of rows.sort((a, b) => b.ts - a.ts)) {
const bucket = byTest.get(r.test_id) ?? [];
if (bucket.length < WINDOW) bucket.push(r.status); // keep newest N
byTest.set(r.test_id, bucket);
}
const report = [...byTest].map(([test_id, s]) => ({
test_id,
runs: s.length,
flake_rate_pct: +(100 * s.filter((x) => x === 'flaky').length / s.length).toFixed(2),
})).filter((r) => r.runs >= 10);
require('fs').writeFileSync('flake-rate.json', JSON.stringify(report, null, 2));
// Trade-off: in-memory is fine for thousands of rows but will not scale
// to full history; move to the SQL store once the ledger gets large.
Pitfalls #
- Counting one build as the rate: a single failure is not a percentage. Mitigation: require a minimum window (e.g. 10 runs) before reporting.
- Unstable test ids: renames split history. Mitigation: key on a slug you control and treat renames as deliberate resets.
- Mixing failed and flaky: a hard failure is not flakiness. Mitigation: only count
flaky(failed-then-recovered) outcomes in the numerator. - Infinite history: old failures haunt fixed tests. Mitigation: use a rolling window and prune rows past the longest window.
- Sharded double-counting: one logical run emits several reports. Mitigation: dedupe by
run_idbefore aggregating.
Reliability targets #
| Metric | Target |
|---|---|
| Rolling window size | 50 runs (≥ 10 minimum) |
| Per-test flake rate | < 1% |
| Suite-wide flake rate | < 0.5% |
| Quarantine threshold | ≥ 5% flake rate |
| History retention | ≥ longest window |
Frequently Asked Questions #
Q: What exactly counts as a flaky outcome? A: A run in which the test failed at least one attempt and then passed on a retry. A clean pass and an all-attempts failure are both excluded from the numerator; only recovery-on-retry is flakiness.
Q: Why use a rolling window instead of all history? A: A rolling window makes the rate responsive. A test fixed last week should trend toward zero quickly, which it cannot do if failures from months ago stay in the denominator forever.
Q: How do I visualize the result? A: Feed the per-test percentages to a dashboard. See tracking test flakiness trends over time for trends and the reliability dashboards for QA teams guide for charting.