Surfacing Flakiness With Playwright

Playwright assigns a `flaky` status automatically; the JSON reporter is where you read it back out.

Root cause #

When retries is greater than zero, Playwright re-runs a failed test in a fresh worker context. If a later attempt passes, the test’s outcome is recorded as flaky rather than passed or failed. That status is the product of genuine non-determinism — typically an auto-waiting timeout that occasionally fires, a network response that arrives outside the expected window, or shared backend state colliding across parallel workers.

The reason teams miss this signal is that a run with only flaky tests still exits zero, so CI is green and nobody looks closer. Playwright is doing the hard part of classification for you; the gap is purely in surfacing. Because each retry exposes testInfo.retry, you can also attach per-attempt diagnostics and then mine the JSON report for the complete flaky set after the run.

Step-by-step fix #

1. Enable retries in config #

Set retries so Playwright can classify outcomes. Keep the count low on CI and zero locally so developers see failures immediately.

// playwright.config.ts
import { defineConfig } from '@playwright/test';

export default defineConfig({
  // Retries only on CI; local runs stay strict so flakiness is visible.
  retries: process.env.CI ? 2 : 0,
  reporter: [['list'], ['json', { outputFile: 'results.json' }]],
});
// Trade-off: 2 retries keeps CI unblocked but can mask a 50%-failing
// test; rely on the JSON parse below to catch recurrence, not the exit code.

2. Tag the attempt with testInfo.retry #

Inside tests or a fixture, use testInfo.retry to record that an attempt was a retry. This lets you collect richer artifacts only when it matters.

// fixtures.ts
import { test as base } from '@playwright/test';

export const test = base.extend({
  page: async ({ page }, use, testInfo) => {
    // testInfo.retry is 0 on the first attempt, >0 on retries.
    if (testInfo.retry > 0) {
      console.warn(`Retry ${testInfo.retry} for "${testInfo.title}"`);
    }
    await use(page);
  },
});
// Trade-off: capturing traces on every retry aids debugging but adds
// CI storage cost; gate trace capture on testInfo.retry > 0.

3. Parse the JSON report for the flaky set #

The JSON reporter writes per-spec results. A test whose status is flaky is exactly what you want; you do not need to recompute it from attempts.

// extract-flaky.js
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('results.json', 'utf8'));
const flaky = [];

function walk(suite, path = []) {
  for (const s of suite.suites ?? []) walk(s, [...path, s.title]);
  for (const spec of suite.specs ?? []) {
    for (const t of spec.tests) {
      // Playwright sets status 'flaky' when a retry passed after a failure.
      if (t.status === 'flaky') {
        flaky.push({ file: spec.file, title: [...path, spec.title].join(' > ') });
      }
    }
  }
}
report.suites.forEach((s) => walk(s));
fs.writeFileSync('flaky-playwright.json', JSON.stringify(flaky, null, 2));
console.warn(`Surfaced ${flaky.length} flaky test(s).`);
// Trade-off: reading the report is free, but merge per-shard JSON files
// before counting so sharded runs do not undercount the flaky set.

4. Feed the flaky set forward #

Use the extracted list to open issues, gate a budget, or hand it to quarantine. The same JSON shape works as the input to a quarantine workflow.

// budget.js
const flaky = require('./flaky-playwright.json');
const BUDGET = 5;
if (flaky.length > BUDGET) {
  console.error(`Flaky budget exceeded: ${flaky.length} > ${BUDGET}`);
  process.exit(1); // block merge until the regression is triaged
}
// Trade-off: a strict budget catches drift early but can stall PRs;
// start with the current baseline and ratchet down weekly.

Pitfalls #

Reading the exit code instead of the report: a flaky-only run exits zero. Mitigation: always parse results.json and act on status === 'flaky'.
Recomputing flakiness from attempts: Playwright already labels it. Mitigation: trust the flaky status; only walk attempts for diagnostics.
Undercounting across shards: each shard emits a partial report. Mitigation: use blob reporter and merge-reports, or merge JSON before counting.
Retries hiding hard failures over time: a steadily worsening test stays “flaky”. Mitigation: track the flake rate trend, not just the per-run count.
Capturing traces on every attempt: storage and time balloon. Mitigation: gate artifacts on testInfo.retry > 0.

Reliability targets #

Metric	Target
Retries on CI	1-2
Retries locally	0
Flaky tests per run	≤ 5 (budgeted)
Flaky rate per suite	< 1.5%
CI pass rate (post-retry)	≥ 99%

Frequently Asked Questions #

Q: What is the difference between failed and flaky in Playwright? A: failed means every attempt failed; flaky means at least one attempt failed but a later one passed. Only flaky indicates non-determinism worth recording.

Q: Do retries change the worker or context? A: Yes. Each retry runs in a fresh worker and a new browser context, so leaked in-memory state from a prior attempt does not carry over — which is why ordering and backend-state issues still cause flakiness.

Q: How does this relate to the Jest approach? A: The mechanism is the same idea applied to a different runner. See detecting flaky tests with Jest retryTimes for the unit-test equivalent and a custom reporter.