The core difference is where the “flaky” judgment lives. Playwright bakes a first-class flaky outcome into its JSON reporter when a test fails then passes on retry, so detection is mostly a matter of reading status fields. Cypress has no built-in flaky verdict — you infer it from attempt counts in the mocha-style results, or you lean on Cypress Cloud’s analytics. That single distinction ripples through every layer below.
Prerequisites #
| Component | Cypress | Playwright |
|---|---|---|
| Package | cypress@13+ |
@playwright/[email protected]+ |
| Node | 18 LTS or 20 LTS | 18 LTS or 20 LTS |
| CI runner | Linux container, 2 vCPU min | Linux container, 2 vCPU min |
| Reporter for detection | cypress JSON + mochawesome |
built-in json / blob |
| Artifact for triage | video + screenshots on fail | trace: 'on-first-retry' |
| Parallelism source | Cypress Cloud (--record --parallel) |
--shard + workers |
Install the reporters you will parse before wiring detection. For Cypress, pair the native JSON output with mochawesome for attempt-level detail; for Playwright, the built-in json reporter already carries status, retry, and results[].
Step-by-step implementation #
1. Enable retries so flakiness becomes observable #
Retries are what convert a hard failure into a measurable flake signal. In Cypress, configure split run/open retry counts so local debugging stays strict.
// cypress.config.ts
import { defineConfig } from 'cypress';
export default defineConfig({
retries: {
runMode: 2, // CI: a pass after a fail flags a flake (cost: up to 3x on failure)
openMode: 0, // local: never hide nondeterminism while debugging
},
});
Playwright exposes the same idea but emits an explicit verdict.
// playwright.config.ts
import { defineConfig } from '@playwright/test';
export default defineConfig({
retries: process.env.CI ? 2 : 0, // CI-only retries; status becomes "flaky" on retry-pass
reporter: [['json', { outputFile: 'results.json' }], ['list']],
});
Trade-off: Retries shrink red builds but also mask genuine regressions if you treat retry-passes as green. Always record the retry count as a separate metric instead of discarding it. The same retry-tooling discipline drives automated flaky test detection tools.
2. Emit a machine-readable report #
The detector consumes the report, not the console. Cypress JSON gives you attempts[] per test, which is your flake proxy.
// scripts/detect-cypress-flakes.ts
import results from '../results/mocha.json';
const flaky = results.tests.filter(
(t: any) => t.attempts && t.attempts.length > 1 && t.state === 'passed'
); // passed-after-retry == flaky in Cypress (no native flag)
console.log(`flaky specs: ${flaky.length}`);
Playwright hands you the verdict directly.
// scripts/detect-pw-flakes.ts
import report from '../results.json';
const flaky = report.suites
.flatMap((s: any) => s.specs)
.filter((spec: any) => spec.tests.some((t: any) => t.status === 'flaky'));
console.log(`flaky tests: ${flaky.length}`);
Trade-off: Cypress’s inferred signal can misclassify a deterministically-fixed-by-reseeding test as flaky; Playwright’s flag is precise but only fires when retries are enabled. Choose retry counts deliberately.
3. Attach a trace artifact for root-cause triage #
Detection tells you that a test flaked; the artifact tells you why. Playwright’s trace.zip is the richer artifact — it bundles DOM snapshots, network, and console into one viewer.
// playwright.config.ts (artifact slice)
export default {
use: {
trace: 'on-first-retry', // capture only when a retry happens, to control storage cost
video: 'retain-on-failure',
},
};
Cypress captures video and screenshots automatically on failure; opt into per-test screenshots for finer signal.
// cypress/e2e/checkout.cy.ts
afterEach(function () {
if (this.currentTest?.state === 'failed') {
cy.screenshot(`fail-${this.currentTest.title}`); // narrows triage to the failing step
}
});
Trace network analysis is where mocking discipline pays off — flakes that trace back to unstubbed calls are best fixed with Cypress Network Interception Patterns or Playwright Route Mocking Strategies rather than more retries.
4. Parallelize without poisoning the signal #
Parallelism is the single biggest source of false flakes from shared state. Cypress balances at the spec level through Cypress Cloud.
# CI: Cypress spec-level parallelism via Cloud
npx cypress run --record --parallel --ci-build-id "$GITHUB_RUN_ID"
# trade-off: load balancing needs Cloud; without it, manual spec splitting drifts
Playwright shards deterministically without a hosted service.
# CI matrix: shard 1 of 4, N workers per shard
npx playwright test --shard=1/4 --workers=2
# trade-off: cross-shard state collisions look like flakes; isolate fixtures per worker
Trade-off: More workers means faster builds but higher odds of race conditions in parallel runs leaking across tests. Cap worker counts until per-worker isolation is proven, then route confirmed flakes into auto-quarantine workflows.
Configuration reference #
| Option | Runner | Accepted values | Default | Effect on flakiness |
|---|---|---|---|---|
retries.runMode |
Cypress | integer ≥ 0 | 0 |
Higher reveals flakes but inflates CI minutes on failure |
retries |
Playwright | integer ≥ 0 | 0 |
Enables the native flaky status when > 0 |
trace |
Playwright | off/on/on-first-retry/retain-on-failure |
off |
on-first-retry gives best signal-to-storage ratio |
video |
Cypress | true/false |
true |
Disabling saves storage but blinds triage |
--workers |
Playwright | integer / % |
logical cores | More workers = faster but more cross-test interference |
--parallel |
Cypress | flag (needs Cloud) | off | Spec-level balancing; reduces wall time, needs --record |
Interpreting the data #
Each runner produces a different primary metric. From Playwright, count status === 'flaky' per test across a window of runs — a test crossing ~1% flake rate over 50 runs warrants quarantine. From Cypress, compute attempts.length > 1 && state === 'passed' as the flake proxy and divide by total runs.
Read the trend, not the snapshot. A single flaky verdict is noise; a test that flakes in 3 of the last 20 builds is a stable signal. Stream both counts into historical flakiness tracking analytics so the verdict survives across builds. Escalate to quarantine when a test’s rolling flake rate exceeds your SLO budget for two consecutive windows — escalating on a single spike just churns the suite.
Common pitfalls & mitigations #
- Treating Playwright retry-passes as plain green. They are not — preserve the
flakystatus into your report or you lose every detection signal. Mitigation: always parsestatus, never just exit codes. - Inferring Cypress flakiness from exit code alone. A passed-after-retry spec exits 0, hiding the flake. Mitigation: parse
attempts[]from the JSON/mochawesome report. - Cross-shard state bleed read as flakiness. Shared DB rows or seeds across Playwright shards mimic nondeterminism. Mitigation: isolate fixtures per worker and per shard.
- Comparing the two runners on raw failure counts. The retry semantics differ, so raw counts are not comparable. Mitigation: normalize to a rolling flake-rate percentage.
- Storing every trace.
trace: 'on'explodes artifact storage. Mitigation: useon-first-retryso only suspect runs persist.
Frequently Asked Questions #
Q: Does Playwright’s built-in flaky status mean I do not need a detection pipeline? A: It removes the classification step, but you still need to aggregate verdicts across runs, persist history, and trigger quarantine. The flag is the input to detection, not a replacement for it.
Q: Can I get a native flaky verdict in Cypress without Cypress Cloud?
A: Not a first-class one. You derive it by parsing attempts[] from the JSON or mochawesome reporter — a passed test with more than one attempt is your flake signal. Cypress Cloud adds hosted analytics on top of that same data.
Q: Which runner makes root-cause triage faster?
A: Playwright’s single trace.zip (DOM snapshots, network, console in one viewer) is generally faster to triage than Cypress’s separate video and screenshot artifacts, especially for network-driven flakes.