Cypress vs Playwright: Flaky Test

Both runners surface flakiness, but through different retry semantics, reporters, and artifact formats.

The core difference is where the “flaky” judgment lives. Playwright bakes a first-class flaky outcome into its JSON reporter when a test fails then passes on retry, so detection is mostly a matter of reading status fields. Cypress has no built-in flaky verdict — you infer it from attempt counts in the mocha-style results, or you lean on Cypress Cloud’s analytics. That single distinction ripples through every layer below.

Prerequisites #

Component	Cypress	Playwright
Package	`cypress@13+`	`@playwright/[email protected]+`
Node	18 LTS or 20 LTS	18 LTS or 20 LTS
CI runner	Linux container, 2 vCPU min	Linux container, 2 vCPU min
Reporter for detection	`cypress` JSON + `mochawesome`	built-in `json` / `blob`
Artifact for triage	video + screenshots on fail	`trace: 'on-first-retry'`
Parallelism source	Cypress Cloud (`--record --parallel`)	`--shard` + `workers`

Install the reporters you will parse before wiring detection. For Cypress, pair the native JSON output with mochawesome for attempt-level detail; for Playwright, the built-in json reporter already carries status, retry, and results[].

Step-by-step implementation #

1. Enable retries so flakiness becomes observable #

Retries are what convert a hard failure into a measurable flake signal. In Cypress, configure split run/open retry counts so local debugging stays strict.

// cypress.config.ts
import { defineConfig } from 'cypress';

export default defineConfig({
  retries: {
    runMode: 2,   // CI: a pass after a fail flags a flake (cost: up to 3x on failure)
    openMode: 0,  // local: never hide nondeterminism while debugging
  },
});

Playwright exposes the same idea but emits an explicit verdict.

// playwright.config.ts
import { defineConfig } from '@playwright/test';

export default defineConfig({
  retries: process.env.CI ? 2 : 0, // CI-only retries; status becomes "flaky" on retry-pass
  reporter: [['json', { outputFile: 'results.json' }], ['list']],
});

Trade-off: Retries shrink red builds but also mask genuine regressions if you treat retry-passes as green. Always record the retry count as a separate metric instead of discarding it. The same retry-tooling discipline drives automated flaky test detection tools.

2. Emit a machine-readable report #

The detector consumes the report, not the console. Cypress JSON gives you attempts[] per test, which is your flake proxy.

// scripts/detect-cypress-flakes.ts
import results from '../results/mocha.json';

const flaky = results.tests.filter(
  (t: any) => t.attempts && t.attempts.length > 1 && t.state === 'passed'
); // passed-after-retry == flaky in Cypress (no native flag)
console.log(`flaky specs: ${flaky.length}`);

Playwright hands you the verdict directly.

// scripts/detect-pw-flakes.ts
import report from '../results.json';

const flaky = report.suites
  .flatMap((s: any) => s.specs)
  .filter((spec: any) => spec.tests.some((t: any) => t.status === 'flaky'));
console.log(`flaky tests: ${flaky.length}`);

Trade-off: Cypress’s inferred signal can misclassify a deterministically-fixed-by-reseeding test as flaky; Playwright’s flag is precise but only fires when retries are enabled. Choose retry counts deliberately.

3. Attach a trace artifact for root-cause triage #

Detection tells you that a test flaked; the artifact tells you why. Playwright’s trace.zip is the richer artifact — it bundles DOM snapshots, network, and console into one viewer.

// playwright.config.ts (artifact slice)
export default {
  use: {
    trace: 'on-first-retry', // capture only when a retry happens, to control storage cost
    video: 'retain-on-failure',
  },
};

Cypress captures video and screenshots automatically on failure; opt into per-test screenshots for finer signal.

// cypress/e2e/checkout.cy.ts
afterEach(function () {
  if (this.currentTest?.state === 'failed') {
    cy.screenshot(`fail-${this.currentTest.title}`); // narrows triage to the failing step
  }
});

Trace network analysis is where mocking discipline pays off — flakes that trace back to unstubbed calls are best fixed with Cypress Network Interception Patterns or Playwright Route Mocking Strategies rather than more retries.

4. Parallelize without poisoning the signal #

Parallelism is the single biggest source of false flakes from shared state. Cypress balances at the spec level through Cypress Cloud.

# CI: Cypress spec-level parallelism via Cloud
npx cypress run --record --parallel --ci-build-id "$GITHUB_RUN_ID"
# trade-off: load balancing needs Cloud; without it, manual spec splitting drifts

Playwright shards deterministically without a hosted service.

# CI matrix: shard 1 of 4, N workers per shard
npx playwright test --shard=1/4 --workers=2
# trade-off: cross-shard state collisions look like flakes; isolate fixtures per worker

Trade-off: More workers means faster builds but higher odds of race conditions in parallel runs leaking across tests. Cap worker counts until per-worker isolation is proven, then route confirmed flakes into auto-quarantine workflows.

Configuration reference #

Option	Runner	Accepted values	Default	Effect on flakiness
`retries.runMode`	Cypress	integer ≥ 0	`0`	Higher reveals flakes but inflates CI minutes on failure
`retries`	Playwright	integer ≥ 0	`0`	Enables the native `flaky` status when > 0
`trace`	Playwright	`off`/`on`/`on-first-retry`/`retain-on-failure`	`off`	`on-first-retry` gives best signal-to-storage ratio
`video`	Cypress	`true`/`false`	`true`	Disabling saves storage but blinds triage
`--workers`	Playwright	integer / `%`	logical cores	More workers = faster but more cross-test interference
`--parallel`	Cypress	flag (needs Cloud)	off	Spec-level balancing; reduces wall time, needs `--record`

Interpreting the data #

Each runner produces a different primary metric. From Playwright, count status === 'flaky' per test across a window of runs — a test crossing ~1% flake rate over 50 runs warrants quarantine. From Cypress, compute attempts.length > 1 && state === 'passed' as the flake proxy and divide by total runs.

Read the trend, not the snapshot. A single flaky verdict is noise; a test that flakes in 3 of the last 20 builds is a stable signal. Stream both counts into historical flakiness tracking analytics so the verdict survives across builds. Escalate to quarantine when a test’s rolling flake rate exceeds your SLO budget for two consecutive windows — escalating on a single spike just churns the suite.

Common pitfalls & mitigations #

Treating Playwright retry-passes as plain green. They are not — preserve the flaky status into your report or you lose every detection signal. Mitigation: always parse status, never just exit codes.
Inferring Cypress flakiness from exit code alone. A passed-after-retry spec exits 0, hiding the flake. Mitigation: parse attempts[] from the JSON/mochawesome report.
Cross-shard state bleed read as flakiness. Shared DB rows or seeds across Playwright shards mimic nondeterminism. Mitigation: isolate fixtures per worker and per shard.
Comparing the two runners on raw failure counts. The retry semantics differ, so raw counts are not comparable. Mitigation: normalize to a rolling flake-rate percentage.
Storing every trace. trace: 'on' explodes artifact storage. Mitigation: use on-first-retry so only suspect runs persist.

Frequently Asked Questions #

Q: Does Playwright’s built-in flaky status mean I do not need a detection pipeline? A: It removes the classification step, but you still need to aggregate verdicts across runs, persist history, and trigger quarantine. The flag is the input to detection, not a replacement for it.

Q: Can I get a native flaky verdict in Cypress without Cypress Cloud? A: Not a first-class one. You derive it by parsing attempts[] from the JSON or mochawesome reporter — a passed test with more than one attempt is your flake signal. Cypress Cloud adds hosted analytics on top of that same data.

Q: Which runner makes root-cause triage faster? A: Playwright’s single trace.zip (DOM snapshots, network, console in one viewer) is generally faster to triage than Cypress’s separate video and screenshot artifacts, especially for network-driven flakes.