Article · Flaky Test Detection & Quarantine Engineering

Surfacing Flakiness With Playwright Test Retries

Playwright already tracks retries for you and labels a test flaky when it fails then passes within the same run, so the work is not adding retries but extracting that signal from the report. This guide is part of Automated Flaky Test Detection Tools and walks through reading testInfo.retry, parsing the JSON reporter, and pulling out the exact set of tests Playwright already knows are unstable.

10 sections URL: /flaky-test-detection-quarantine-engineering/automated-flaky-test-detection-tools/surfacing-flakiness-with-playwright-retries/
Playwright retry status flow A test result moves through attempts and is assigned one of three outcomes: passed, flaky, or failed, with the JSON reporter capturing each. retry 0 first attempt fail retry 1 testInfo.retry=1 status: flaky passed on retry status: failed all attempts fail JSON reporter
Playwright assigns a `flaky` status automatically; the JSON reporter is where you read it back out.

Root cause #

When retries is greater than zero, Playwright re-runs a failed test in a fresh worker context. If a later attempt passes, the test’s outcome is recorded as flaky rather than passed or failed. That status is the product of genuine non-determinism — typically an auto-waiting timeout that occasionally fires, a network response that arrives outside the expected window, or shared backend state colliding across parallel workers.

The reason teams miss this signal is that a run with only flaky tests still exits zero, so CI is green and nobody looks closer. Playwright is doing the hard part of classification for you; the gap is purely in surfacing. Because each retry exposes testInfo.retry, you can also attach per-attempt diagnostics and then mine the JSON report for the complete flaky set after the run.

Step-by-step fix #

1. Enable retries in config #

Set retries so Playwright can classify outcomes. Keep the count low on CI and zero locally so developers see failures immediately.

// playwright.config.ts
import { defineConfig } from '@playwright/test';

export default defineConfig({
  // Retries only on CI; local runs stay strict so flakiness is visible.
  retries: process.env.CI ? 2 : 0,
  reporter: [['list'], ['json', { outputFile: 'results.json' }]],
});
// Trade-off: 2 retries keeps CI unblocked but can mask a 50%-failing
// test; rely on the JSON parse below to catch recurrence, not the exit code.

2. Tag the attempt with testInfo.retry #

Inside tests or a fixture, use testInfo.retry to record that an attempt was a retry. This lets you collect richer artifacts only when it matters.

// fixtures.ts
import { test as base } from '@playwright/test';

export const test = base.extend({
  page: async ({ page }, use, testInfo) => {
    // testInfo.retry is 0 on the first attempt, >0 on retries.
    if (testInfo.retry > 0) {
      console.warn(`Retry ${testInfo.retry} for "${testInfo.title}"`);
    }
    await use(page);
  },
});
// Trade-off: capturing traces on every retry aids debugging but adds
// CI storage cost; gate trace capture on testInfo.retry > 0.

3. Parse the JSON report for the flaky set #

The JSON reporter writes per-spec results. A test whose status is flaky is exactly what you want; you do not need to recompute it from attempts.

// extract-flaky.js
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('results.json', 'utf8'));
const flaky = [];

function walk(suite, path = []) {
  for (const s of suite.suites ?? []) walk(s, [...path, s.title]);
  for (const spec of suite.specs ?? []) {
    for (const t of spec.tests) {
      // Playwright sets status 'flaky' when a retry passed after a failure.
      if (t.status === 'flaky') {
        flaky.push({ file: spec.file, title: [...path, spec.title].join(' > ') });
      }
    }
  }
}
report.suites.forEach((s) => walk(s));
fs.writeFileSync('flaky-playwright.json', JSON.stringify(flaky, null, 2));
console.warn(`Surfaced ${flaky.length} flaky test(s).`);
// Trade-off: reading the report is free, but merge per-shard JSON files
// before counting so sharded runs do not undercount the flaky set.

4. Feed the flaky set forward #

Use the extracted list to open issues, gate a budget, or hand it to quarantine. The same JSON shape works as the input to a quarantine workflow.

// budget.js
const flaky = require('./flaky-playwright.json');
const BUDGET = 5;
if (flaky.length > BUDGET) {
  console.error(`Flaky budget exceeded: ${flaky.length} > ${BUDGET}`);
  process.exit(1); // block merge until the regression is triaged
}
// Trade-off: a strict budget catches drift early but can stall PRs;
// start with the current baseline and ratchet down weekly.

Pitfalls #

  • Reading the exit code instead of the report: a flaky-only run exits zero. Mitigation: always parse results.json and act on status === 'flaky'.
  • Recomputing flakiness from attempts: Playwright already labels it. Mitigation: trust the flaky status; only walk attempts for diagnostics.
  • Undercounting across shards: each shard emits a partial report. Mitigation: use blob reporter and merge-reports, or merge JSON before counting.
  • Retries hiding hard failures over time: a steadily worsening test stays “flaky”. Mitigation: track the flake rate trend, not just the per-run count.
  • Capturing traces on every attempt: storage and time balloon. Mitigation: gate artifacts on testInfo.retry > 0.

Reliability targets #

Metric Target
Retries on CI 1-2
Retries locally 0
Flaky tests per run ≤ 5 (budgeted)
Flaky rate per suite < 1.5%
CI pass rate (post-retry) ≥ 99%

Frequently Asked Questions #

Q: What is the difference between failed and flaky in Playwright? A: failed means every attempt failed; flaky means at least one attempt failed but a later one passed. Only flaky indicates non-determinism worth recording.

Q: Do retries change the worker or context? A: Yes. Each retry runs in a fresh worker and a new browser context, so leaked in-memory state from a prior attempt does not carry over — which is why ordering and backend-state issues still cause flakiness.

Q: How does this relate to the Jest approach? A: The mechanism is the same idea applied to a different runner. See detecting flaky tests with Jest retryTimes for the unit-test equivalent and a custom reporter.