Article · Flaky Test Detection & Quarantine Engineering

Calculating Flake Rate From CI History

A single run tells you a test failed; only the history tells you how often it flakes, so flake rate is the metric that turns scattered red builds into a number you can act on. This guide is part of Historical Flakiness Tracking & Analytics and shows how to aggregate JUnit and JSON results over time, apply the flake-rate formula across a rolling window, and compute it with SQL or Node.

10 sections URL: /flaky-test-detection-quarantine-engineering/historical-flakiness-tracking-analytics/calculating-flake-rate-from-ci-history/
Flake rate from a rolling window Per-run results stream into a store, a rolling window selects recent runs, and the flake-rate formula produces a per-test percentage. run results JUnit / JSON results store one row per test rolling window last N runs flake rate per test % flake rate = flaky outcomes / total runs flaky = failed-then-passed on retry, within the window
Flake rate is the share of runs in a window where a test recovered on retry rather than passing or failing cleanly.

Root cause #

Flakiness is statistical, so any metric computed from one build is noise. A test that fails 1 run in 50 looks identical to a hard failure if you only see the failing build. The data you need already exists — every CI run emits a JUnit XML or JSON report listing each test’s outcome and whether a retry recovered it — but it is scattered across thousands of ephemeral artifacts that disappear when the runner is recycled.

The job is therefore aggregation, not detection. You persist each run’s per-test outcomes into a durable store keyed by a stable test identifier (file plus full title), then compute flake rate over a rolling window of recent runs. The standard definition: flake rate is the number of runs in which a test produced a flaky outcome (failed at least once, then passed on a retry) divided by the total number of runs of that test in the window. A rolling window keeps the number responsive — a test fixed last week should not be penalized by failures from a month ago.

Step-by-step fix #

1. Normalize each run’s report into rows #

Parse the JUnit XML or runner JSON into one record per test per run, with a stable id and a status of passed, failed, or flaky.

// ingest.js — normalize one run into rows for the store
const { XMLParser } = require('fast-xml-parser');
const fs = require('fs');

function toRows(xmlPath, runId, ts) {
  const doc = new XMLParser({ ignoreAttributes: false }).parse(fs.readFileSync(xmlPath));
  const cases = [].concat(doc.testsuites?.testsuite ?? []).flatMap((s) => [].concat(s.testcase ?? []));
  return cases.map((c) => ({
    run_id: runId,
    ts,
    test_id: `${c['@_classname']}::${c['@_name']}`,
    // 'flaky' = recovered on retry; JUnit marks reruns with a 'flakyFailure' node.
    status: c.flakyFailure ? 'flaky' : c.failure ? 'failed' : 'passed',
  }));
}
module.exports = { toRows };
// Trade-off: keying on classname+name is stable across runs but breaks
// if you rename a test; treat a rename as a new id and accept a reset.

2. Persist rows to a durable store #

Append rows to a table that outlives the runner. One row per test per run keeps the window query simple.

-- schema.sql
CREATE TABLE IF NOT EXISTS test_runs (
  run_id   TEXT NOT NULL,
  ts       TIMESTAMP NOT NULL,
  test_id  TEXT NOT NULL,
  status   TEXT NOT NULL CHECK (status IN ('passed','failed','flaky'))
);
-- Index the window query path: most-recent rows per test.
CREATE INDEX IF NOT EXISTS idx_runs_test_ts ON test_runs (test_id, ts DESC);
-- Trade-off: storing every row is simple and auditable but grows fast;
-- prune rows older than the longest window you query (CI storage cost).

3. Compute flake rate over a rolling window #

Aggregate the last N runs per test. Flake rate is flaky outcomes over total runs in the window.

-- flake_rate.sql — rolling 50-run window per test
WITH windowed AS (
  SELECT test_id, status,
         ROW_NUMBER() OVER (PARTITION BY test_id ORDER BY ts DESC) AS rn
  FROM test_runs
)
SELECT test_id,
       COUNT(*) AS runs,
       SUM(CASE WHEN status = 'flaky' THEN 1 ELSE 0 END) AS flaky_runs,
       ROUND(100.0 * SUM(CASE WHEN status = 'flaky' THEN 1 ELSE 0 END) / COUNT(*), 2) AS flake_rate_pct
FROM windowed
WHERE rn <= 50            -- rolling window of the last 50 runs
GROUP BY test_id
HAVING COUNT(*) >= 10     -- ignore tests with too little history to be meaningful
ORDER BY flake_rate_pct DESC;
-- Trade-off: a 50-run window reacts fast but is jumpy for rarely-run
-- tests; widen it for low-frequency suites to avoid noisy percentages.

4. Or aggregate in Node when there is no database #

For small teams a JSON ledger and an in-memory reduce is enough. Same formula, no SQL.

// flake-rate.js — compute per-test flake rate from a JSON ledger
const rows = require('./test_runs.json'); // [{test_id, ts, status}, ...]
const WINDOW = 50;
const byTest = new Map();
for (const r of rows.sort((a, b) => b.ts - a.ts)) {
  const bucket = byTest.get(r.test_id) ?? [];
  if (bucket.length < WINDOW) bucket.push(r.status); // keep newest N
  byTest.set(r.test_id, bucket);
}
const report = [...byTest].map(([test_id, s]) => ({
  test_id,
  runs: s.length,
  flake_rate_pct: +(100 * s.filter((x) => x === 'flaky').length / s.length).toFixed(2),
})).filter((r) => r.runs >= 10);
require('fs').writeFileSync('flake-rate.json', JSON.stringify(report, null, 2));
// Trade-off: in-memory is fine for thousands of rows but will not scale
// to full history; move to the SQL store once the ledger gets large.

Pitfalls #

  • Counting one build as the rate: a single failure is not a percentage. Mitigation: require a minimum window (e.g. 10 runs) before reporting.
  • Unstable test ids: renames split history. Mitigation: key on a slug you control and treat renames as deliberate resets.
  • Mixing failed and flaky: a hard failure is not flakiness. Mitigation: only count flaky (failed-then-recovered) outcomes in the numerator.
  • Infinite history: old failures haunt fixed tests. Mitigation: use a rolling window and prune rows past the longest window.
  • Sharded double-counting: one logical run emits several reports. Mitigation: dedupe by run_id before aggregating.

Reliability targets #

Metric Target
Rolling window size 50 runs (≥ 10 minimum)
Per-test flake rate < 1%
Suite-wide flake rate < 0.5%
Quarantine threshold ≥ 5% flake rate
History retention ≥ longest window

Frequently Asked Questions #

Q: What exactly counts as a flaky outcome? A: A run in which the test failed at least one attempt and then passed on a retry. A clean pass and an all-attempts failure are both excluded from the numerator; only recovery-on-retry is flakiness.

Q: Why use a rolling window instead of all history? A: A rolling window makes the rate responsive. A test fixed last week should trend toward zero quickly, which it cannot do if failures from months ago stay in the denominator forever.

Q: How do I visualize the result? A: Feed the per-test percentages to a dashboard. See tracking test flakiness trends over time for trends and the reliability dashboards for QA teams guide for charting.