Subtopic · Flaky Test Detection & Quarantine Engineering

Cypress vs Playwright: Flaky Test Detection Architecture

Choosing a runner shapes how flakiness surfaces, gets attributed, and gets quarantined, so this comparison sits squarely inside Flaky Test Detection & Quarantine Engineering. Cypress and Playwright both retry, both emit reporters, and both capture artifacts — but their retry models, trace formats, and parallelization strategies differ enough that a detection pipeline built for one rarely transfers cleanly to the other. This guide maps both architectures side by side so you can pick reporters and artifacts that feed your detection signal instead of fighting it.

11 sections URL: /flaky-test-detection-quarantine-engineering/cypress-vs-playwright-detection-architecture/
Cypress vs Playwright flakiness detection architecture Side-by-side comparison of how Cypress and Playwright surface flakiness through retries, reporters, trace artifacts, and parallelization. Cypress Playwright Retry model test-level retries (run mode) Retry model retries + flaky status flag Reporter mocha JSON / Cloud Reporter JSON / JUnit / blob Trace artifact video + screenshots Trace artifact trace.zip (DOM + net) Parallelization spec-level (Cloud balancing) Parallelization shards + workers signal: retry-then-pass = flaky signal: status==="flaky" emitted
Both runners surface flakiness, but through different retry semantics, reporters, and artifact formats.

The core difference is where the “flaky” judgment lives. Playwright bakes a first-class flaky outcome into its JSON reporter when a test fails then passes on retry, so detection is mostly a matter of reading status fields. Cypress has no built-in flaky verdict — you infer it from attempt counts in the mocha-style results, or you lean on Cypress Cloud’s analytics. That single distinction ripples through every layer below.

Prerequisites #

Component Cypress Playwright
Package cypress@13+ @playwright/[email protected]+
Node 18 LTS or 20 LTS 18 LTS or 20 LTS
CI runner Linux container, 2 vCPU min Linux container, 2 vCPU min
Reporter for detection cypress JSON + mochawesome built-in json / blob
Artifact for triage video + screenshots on fail trace: 'on-first-retry'
Parallelism source Cypress Cloud (--record --parallel) --shard + workers

Install the reporters you will parse before wiring detection. For Cypress, pair the native JSON output with mochawesome for attempt-level detail; for Playwright, the built-in json reporter already carries status, retry, and results[].

Step-by-step implementation #

1. Enable retries so flakiness becomes observable #

Retries are what convert a hard failure into a measurable flake signal. In Cypress, configure split run/open retry counts so local debugging stays strict.

// cypress.config.ts
import { defineConfig } from 'cypress';

export default defineConfig({
  retries: {
    runMode: 2,   // CI: a pass after a fail flags a flake (cost: up to 3x on failure)
    openMode: 0,  // local: never hide nondeterminism while debugging
  },
});

Playwright exposes the same idea but emits an explicit verdict.

// playwright.config.ts
import { defineConfig } from '@playwright/test';

export default defineConfig({
  retries: process.env.CI ? 2 : 0, // CI-only retries; status becomes "flaky" on retry-pass
  reporter: [['json', { outputFile: 'results.json' }], ['list']],
});

Trade-off: Retries shrink red builds but also mask genuine regressions if you treat retry-passes as green. Always record the retry count as a separate metric instead of discarding it. The same retry-tooling discipline drives automated flaky test detection tools.

2. Emit a machine-readable report #

The detector consumes the report, not the console. Cypress JSON gives you attempts[] per test, which is your flake proxy.

// scripts/detect-cypress-flakes.ts
import results from '../results/mocha.json';

const flaky = results.tests.filter(
  (t: any) => t.attempts && t.attempts.length > 1 && t.state === 'passed'
); // passed-after-retry == flaky in Cypress (no native flag)
console.log(`flaky specs: ${flaky.length}`);

Playwright hands you the verdict directly.

// scripts/detect-pw-flakes.ts
import report from '../results.json';

const flaky = report.suites
  .flatMap((s: any) => s.specs)
  .filter((spec: any) => spec.tests.some((t: any) => t.status === 'flaky'));
console.log(`flaky tests: ${flaky.length}`);

Trade-off: Cypress’s inferred signal can misclassify a deterministically-fixed-by-reseeding test as flaky; Playwright’s flag is precise but only fires when retries are enabled. Choose retry counts deliberately.

3. Attach a trace artifact for root-cause triage #

Detection tells you that a test flaked; the artifact tells you why. Playwright’s trace.zip is the richer artifact — it bundles DOM snapshots, network, and console into one viewer.

// playwright.config.ts (artifact slice)
export default {
  use: {
    trace: 'on-first-retry', // capture only when a retry happens, to control storage cost
    video: 'retain-on-failure',
  },
};

Cypress captures video and screenshots automatically on failure; opt into per-test screenshots for finer signal.

// cypress/e2e/checkout.cy.ts
afterEach(function () {
  if (this.currentTest?.state === 'failed') {
    cy.screenshot(`fail-${this.currentTest.title}`); // narrows triage to the failing step
  }
});

Trace network analysis is where mocking discipline pays off — flakes that trace back to unstubbed calls are best fixed with Cypress Network Interception Patterns or Playwright Route Mocking Strategies rather than more retries.

4. Parallelize without poisoning the signal #

Parallelism is the single biggest source of false flakes from shared state. Cypress balances at the spec level through Cypress Cloud.

# CI: Cypress spec-level parallelism via Cloud
npx cypress run --record --parallel --ci-build-id "$GITHUB_RUN_ID"
# trade-off: load balancing needs Cloud; without it, manual spec splitting drifts

Playwright shards deterministically without a hosted service.

# CI matrix: shard 1 of 4, N workers per shard
npx playwright test --shard=1/4 --workers=2
# trade-off: cross-shard state collisions look like flakes; isolate fixtures per worker

Trade-off: More workers means faster builds but higher odds of race conditions in parallel runs leaking across tests. Cap worker counts until per-worker isolation is proven, then route confirmed flakes into auto-quarantine workflows.

Configuration reference #

Option Runner Accepted values Default Effect on flakiness
retries.runMode Cypress integer ≥ 0 0 Higher reveals flakes but inflates CI minutes on failure
retries Playwright integer ≥ 0 0 Enables the native flaky status when > 0
trace Playwright off/on/on-first-retry/retain-on-failure off on-first-retry gives best signal-to-storage ratio
video Cypress true/false true Disabling saves storage but blinds triage
--workers Playwright integer / % logical cores More workers = faster but more cross-test interference
--parallel Cypress flag (needs Cloud) off Spec-level balancing; reduces wall time, needs --record

Interpreting the data #

Each runner produces a different primary metric. From Playwright, count status === 'flaky' per test across a window of runs — a test crossing ~1% flake rate over 50 runs warrants quarantine. From Cypress, compute attempts.length > 1 && state === 'passed' as the flake proxy and divide by total runs.

Read the trend, not the snapshot. A single flaky verdict is noise; a test that flakes in 3 of the last 20 builds is a stable signal. Stream both counts into historical flakiness tracking analytics so the verdict survives across builds. Escalate to quarantine when a test’s rolling flake rate exceeds your SLO budget for two consecutive windows — escalating on a single spike just churns the suite.

Common pitfalls & mitigations #

  • Treating Playwright retry-passes as plain green. They are not — preserve the flaky status into your report or you lose every detection signal. Mitigation: always parse status, never just exit codes.
  • Inferring Cypress flakiness from exit code alone. A passed-after-retry spec exits 0, hiding the flake. Mitigation: parse attempts[] from the JSON/mochawesome report.
  • Cross-shard state bleed read as flakiness. Shared DB rows or seeds across Playwright shards mimic nondeterminism. Mitigation: isolate fixtures per worker and per shard.
  • Comparing the two runners on raw failure counts. The retry semantics differ, so raw counts are not comparable. Mitigation: normalize to a rolling flake-rate percentage.
  • Storing every trace. trace: 'on' explodes artifact storage. Mitigation: use on-first-retry so only suspect runs persist.

Frequently Asked Questions #

Q: Does Playwright’s built-in flaky status mean I do not need a detection pipeline? A: It removes the classification step, but you still need to aggregate verdicts across runs, persist history, and trigger quarantine. The flag is the input to detection, not a replacement for it.

Q: Can I get a native flaky verdict in Cypress without Cypress Cloud? A: Not a first-class one. You derive it by parsing attempts[] from the JSON or mochawesome reporter — a passed test with more than one attempt is your flake signal. Cypress Cloud adds hosted analytics on top of that same data.

Q: Which runner makes root-cause triage faster? A: Playwright’s single trace.zip (DOM snapshots, network, console in one viewer) is generally faster to triage than Cypress’s separate video and screenshot artifacts, especially for network-driven flakes.