Root cause #
When retries is greater than zero, Playwright re-runs a failed test in a fresh worker context. If a later attempt passes, the test’s outcome is recorded as flaky rather than passed or failed. That status is the product of genuine non-determinism — typically an auto-waiting timeout that occasionally fires, a network response that arrives outside the expected window, or shared backend state colliding across parallel workers.
The reason teams miss this signal is that a run with only flaky tests still exits zero, so CI is green and nobody looks closer. Playwright is doing the hard part of classification for you; the gap is purely in surfacing. Because each retry exposes testInfo.retry, you can also attach per-attempt diagnostics and then mine the JSON report for the complete flaky set after the run.
Step-by-step fix #
1. Enable retries in config #
Set retries so Playwright can classify outcomes. Keep the count low on CI and zero locally so developers see failures immediately.
// playwright.config.ts
import { defineConfig } from '@playwright/test';
export default defineConfig({
// Retries only on CI; local runs stay strict so flakiness is visible.
retries: process.env.CI ? 2 : 0,
reporter: [['list'], ['json', { outputFile: 'results.json' }]],
});
// Trade-off: 2 retries keeps CI unblocked but can mask a 50%-failing
// test; rely on the JSON parse below to catch recurrence, not the exit code.
2. Tag the attempt with testInfo.retry #
Inside tests or a fixture, use testInfo.retry to record that an attempt was a retry. This lets you collect richer artifacts only when it matters.
// fixtures.ts
import { test as base } from '@playwright/test';
export const test = base.extend({
page: async ({ page }, use, testInfo) => {
// testInfo.retry is 0 on the first attempt, >0 on retries.
if (testInfo.retry > 0) {
console.warn(`Retry ${testInfo.retry} for "${testInfo.title}"`);
}
await use(page);
},
});
// Trade-off: capturing traces on every retry aids debugging but adds
// CI storage cost; gate trace capture on testInfo.retry > 0.
3. Parse the JSON report for the flaky set #
The JSON reporter writes per-spec results. A test whose status is flaky is exactly what you want; you do not need to recompute it from attempts.
// extract-flaky.js
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('results.json', 'utf8'));
const flaky = [];
function walk(suite, path = []) {
for (const s of suite.suites ?? []) walk(s, [...path, s.title]);
for (const spec of suite.specs ?? []) {
for (const t of spec.tests) {
// Playwright sets status 'flaky' when a retry passed after a failure.
if (t.status === 'flaky') {
flaky.push({ file: spec.file, title: [...path, spec.title].join(' > ') });
}
}
}
}
report.suites.forEach((s) => walk(s));
fs.writeFileSync('flaky-playwright.json', JSON.stringify(flaky, null, 2));
console.warn(`Surfaced ${flaky.length} flaky test(s).`);
// Trade-off: reading the report is free, but merge per-shard JSON files
// before counting so sharded runs do not undercount the flaky set.
4. Feed the flaky set forward #
Use the extracted list to open issues, gate a budget, or hand it to quarantine. The same JSON shape works as the input to a quarantine workflow.
// budget.js
const flaky = require('./flaky-playwright.json');
const BUDGET = 5;
if (flaky.length > BUDGET) {
console.error(`Flaky budget exceeded: ${flaky.length} > ${BUDGET}`);
process.exit(1); // block merge until the regression is triaged
}
// Trade-off: a strict budget catches drift early but can stall PRs;
// start with the current baseline and ratchet down weekly.
Pitfalls #
- Reading the exit code instead of the report: a flaky-only run exits zero. Mitigation: always parse
results.jsonand act onstatus === 'flaky'. - Recomputing flakiness from attempts: Playwright already labels it. Mitigation: trust the
flakystatus; only walk attempts for diagnostics. - Undercounting across shards: each shard emits a partial report. Mitigation: use
blobreporter andmerge-reports, or merge JSON before counting. - Retries hiding hard failures over time: a steadily worsening test stays “flaky”. Mitigation: track the flake rate trend, not just the per-run count.
- Capturing traces on every attempt: storage and time balloon. Mitigation: gate artifacts on
testInfo.retry > 0.
Reliability targets #
| Metric | Target |
|---|---|
| Retries on CI | 1-2 |
| Retries locally | 0 |
| Flaky tests per run | ≤ 5 (budgeted) |
| Flaky rate per suite | < 1.5% |
| CI pass rate (post-retry) | ≥ 99% |
Frequently Asked Questions #
Q: What is the difference between failed and flaky in Playwright?
A: failed means every attempt failed; flaky means at least one attempt failed but a later one passed. Only flaky indicates non-determinism worth recording.
Q: Do retries change the worker or context? A: Yes. Each retry runs in a fresh worker and a new browser context, so leaked in-memory state from a prior attempt does not carry over — which is why ordering and backend-state issues still cause flakiness.
Q: How does this relate to the Jest approach? A: The mechanism is the same idea applied to a different runner. See detecting flaky tests with Jest retryTimes for the unit-test equivalent and a custom reporter.