Retrying Idempotent Requests Without

Retry only idempotent, retryable failures with backoff and a metric; everything else fails fast, and a rising retry rate is surfaced as flakiness rather than swallowed.

Root cause #

There are two completely different things called “retry” and conflating them is how flakiness hides. A request retry lives in the application’s HTTP client: when a fetch to a flaky upstream fails, the client tries again. A test retry lives in the test runner: when the whole test fails, the runner re-runs it. The first is a legitimate resilience feature your users benefit from; the second is a last-resort flakiness suppressant. The failure mode is using either one to hide a deterministic bug.

App-level request retries become dangerous in three ways. First, retrying a non-idempotent request (a POST that creates an order) can double-charge or duplicate data — the retry is not safe even though it “succeeds.” Second, retrying on a 500 masks a real server error that your test should have caught. Third, an unobserved retry makes a degraded dependency look healthy: the request fails three times, succeeds on the fourth, the test goes green, and nobody learns the upstream is on fire. The fix is not to ban retries but to constrain them — restrict to idempotent methods and retryable status codes, cap the attempts, add backoff, and emit a metric on every retry so the suppression is measurable. Then you can tell legitimate resilience from masked flakiness, because the masking has a number attached to it.

Step-by-step fix #

1. Retry only idempotent methods on retryable failures #

Gate the retry on both the HTTP method and the failure class. A GET timing out is retryable; a POST returning 400 is not.

// src/http/retry.ts — narrow, observable request retry
const RETRYABLE_STATUS = new Set([502, 503, 504]);
const IDEMPOTENT = new Set(['GET', 'HEAD', 'PUT', 'DELETE']);

export async function fetchWithRetry(url: string, init: RequestInit = {}, max = 3) {
  const method = (init.method ?? 'GET').toUpperCase();
  for (let attempt = 0; ; attempt++) {
    const res = await fetch(url, init);
    // Only retry safe methods on transient infra errors; a 500 or a 4xx is a
    // real signal and must surface, not be swallowed by another attempt.
    const retryable = IDEMPOTENT.has(method) && RETRYABLE_STATUS.has(res.status);
    if (res.ok || !retryable || attempt >= max) return res;
    onRetry(url, attempt); // emit a metric — see step 3
    await backoff(attempt);
  }
}

2. Add capped exponential backoff with jitter #

Immediate retries hammer a struggling upstream. Back off, and add jitter so parallel clients do not synchronize their retries into a thundering herd.

// src/http/backoff.ts — exponential backoff with jitter, hard ceiling
export function backoff(attempt: number): Promise<void> {
  const base = Math.min(1000 * 2 ** attempt, 8000); // cap at 8s, not unbounded
  // Jitter prevents synchronized retries from N parallel test workers from
  // colliding into a self-inflicted load spike against the dependency.
  const delay = base / 2 + Math.random() * (base / 2);
  return new Promise(r => setTimeout(r, delay));
}

3. Emit a metric on every retry so masking is visible #

This is the line that separates resilience from suppression. Count every retry and tag it, so a climbing retry rate shows up where you track flaky test detection rather than vanishing into a green checkmark.

// src/http/observe.ts — make every retry measurable
export function onRetry(url: string, attempt: number) {
  // A spike in this counter means an upstream is degrading — surface it as a
  // reliability signal instead of letting the retry quietly hide the failure.
  metrics.increment('http.request.retry', {
    tags: [`endpoint:${new URL(url).pathname}`, `attempt:${attempt}`],
  });
}

4. Keep request retries and test retries separate #

Do not let the test runner’s retry compensate for an app that lacks retries, and do not let app retries hide a deterministic test failure. Configure runner retries narrowly and assert on the retry metric in tests that exercise resilience.

// playwright.config.ts — runner retries are a safety net, not a fix
export default defineConfig({
  // One CI retry catches true infra flakes; more than that hides real bugs.
  // The app's own request retry is tested explicitly, not via runner retries.
  retries: process.env.CI ? 1 : 0,
});

A test that mocks three 503s followed by a 200 should assert the client recovered and that exactly three retries were recorded — proving the resilience works without letting it mask a permanent failure.

Pitfalls #

Retrying non-idempotent requests. A retried POST can duplicate side effects. Mitigation: restrict retries to idempotent methods, or require an idempotency key on writes.
Retrying on 500. A 500 is usually a real defect, not a transient blip. Mitigation: retry only on 502/503/504 and network-level timeouts.
Silent retries. A retry with no metric makes a degraded dependency invisible. Mitigation: emit a counter on every retry and alert on the rate.
Stacking app retries under test retries. Both layers retrying multiplies attempts and deeply buries the real failure. Mitigation: keep runner retries at ≤ 1 and test the app retry path explicitly.
Unbounded backoff. Backoff without a ceiling can stall a test until it times out. Mitigation: cap the delay and the attempt count.

Reliability targets #

Metric	Target	How to hit it
Request retry rate (steady state)	`< 1%` of calls	Retry only idempotent + retryable
Test runner retries	`≤ 1` in CI, `0` locally	Narrow `retries` config
Masked-failure rate	`0%`	Emit retry metric, alert on spikes
Backoff ceiling	`≤ 8s` per attempt	Capped exponential + jitter
Resilience-path test coverage	`100%` of retrying clients	Assert retry count in tests

Frequently Asked Questions #

Q: When is a request retry legitimate versus masking flakiness? A: Legitimate when the failure is transient and the request is idempotent — a GET hitting a 503 during a deploy. Masking when it converts a deterministic failure (a 500 from a real bug, or a POST that should not repeat) into a pass. The metric on every retry is what lets you tell them apart over time.

Q: Should I just turn on test runner retries instead of app retries? A: No — they solve different problems. Runner retries re-run the entire test and hide everything, so keep them at one as a thin safety net. App retries are a product feature you should test explicitly. Treat a rising runner-retry rate as a flakiness signal to investigate, not a setting to crank up.

Q: How do I prove a retry isn’t hiding a real bug? A: Make the retry observable and assert on it. In the resilience test, mock a finite number of transient failures, assert recovery, and assert the exact retry count. In production, alert when the retry-rate metric climbs so a degrading dependency surfaces instead of staying green.