Sending Slack Alerts for Flakiness

Only a sustained breach that is not already open results in a Slack post, suppressing duplicate noise.

Root cause #

An alert that fires on every flaky run trains people to ignore it. Flakiness is, by definition, intermittent — a suite at a 3% flake rate will produce a failing run roughly one time in thirty even when nothing has regressed. If the alert condition is “this run had a flaky test,” the channel fills with single-run noise and the signal of a genuine SLO regression is lost. The mechanism that fixes this has two parts: evaluate the breach against a rolling aggregate rather than a single run, and dedupe so the same open breach does not re-post on every subsequent pipeline.

Slack incoming webhooks are the right transport because they are a single authenticated URL with no OAuth scopes to manage and no per-message rate ceiling that CI will realistically hit. The webhook URL is a secret — anyone holding it can post to the channel — so it lives in repository secrets as SLACK_WEBHOOK_URL, never inlined. The hard part is not sending the message; it is deciding whether to send it at all.

Step-by-step fix #

1. Compute the breach against a rolling window #

Compare the rolling flake rate to the SLO, not the latest run. A single run is too noisy to gate an alert.

// scripts/evaluate-slo.js
// Trade-off: a longer window is more stable but slower to react to a real
// regression; 7 days balances responsiveness against single-run noise.
import { readFileSync } from 'node:fs';

const SLO = 2.0; // percent of executions allowed to be flaky
const history = JSON.parse(readFileSync('flake-history.json', 'utf8')); // last 7d of runs
const flaky = history.reduce((n, r) => n + r.flaky, 0);
const total = history.reduce((n, r) => n + r.total, 0);
const rate = (flaky / total) * 100;

const breached = rate > SLO;
console.log(JSON.stringify({ rate: rate.toFixed(2), slo: SLO, breached }));
process.exit(breached ? 1 : 0);

2. Dedupe so an open breach does not re-fire #

Persist a marker (a cache key or a tiny state file in an artifact) so a breach that is already open is not re-announced every pipeline. Only post on the transition into breach.

// scripts/should-alert.js — true only on the no→yes transition
// Trade-off: state in CI cache can be lost on eviction, causing a re-alert;
// acceptable because a re-alert on an open breach is rarely harmful.
import { existsSync, writeFileSync, rmSync } from 'node:fs';

const breached = process.argv[2] === 'true';
const MARKER = '.flaky-breach-open';
const wasOpen = existsSync(MARKER);

if (breached && !wasOpen) { writeFileSync(MARKER, '1'); console.log('alert'); }
else if (!breached && wasOpen) { rmSync(MARKER); console.log('recovered'); }
else { console.log('skip'); }

3. Post the Slack webhook from GitHub Actions #

Send a Block Kit payload only when step 2 decided to alert, reading the webhook from the SLACK_WEBHOOK_URL secret.

# .github/workflows/flaky-slo.yml
- name: Alert Slack on SLO breach
  if: steps.dedupe.outputs.decision == 'alert'
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
    RATE: ${{ steps.evaluate.outputs.rate }}
    RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
  run: |
    # Trade-off: a rich Block Kit payload is more actionable but harder to
    # template; keep it to one section + one button to stay maintainable.
    curl -sS -X POST "$SLACK_WEBHOOK_URL" \
      -H 'Content-Type: application/json' \
      -d "{\"blocks\":[
        {\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\":rotating_light: *Flakiness SLO breached* — rate *${RATE}%* exceeds 2% on \`${GITHUB_REF_NAME}\`\"}},
        {\"type\":\"actions\",\"elements\":[{\"type\":\"button\",\"text\":{\"type\":\"plain_text\",\"text\":\"View run\"},\"url\":\"${RUN_URL}\"}]}
      ]}"

4. Send a recovery notice #

Closing the loop matters as much as opening it. Post a recovery message on the yes→no transition so the channel knows the breach cleared, and route persistent offenders into building auto-quarantine workflows rather than alerting forever.

- name: Post recovery
  if: steps.dedupe.outputs.decision == 'recovered'
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
  run: |
    # Trade-off: recovery notices add volume but prevent stale "is it fixed?"
    # pings; keep them terse.
    curl -sS -X POST "$SLACK_WEBHOOK_URL" -H 'Content-Type: application/json' \
      -d '{"text":":white_check_mark: Flakiness SLO recovered — rate back under 2%."}'

Pitfalls #

Alerting per run. Single-run failures are expected at any non-zero flake rate. Mitigation: evaluate against a rolling window, not the latest run.
No deduping. Re-posting an open breach every pipeline gets the channel muted. Mitigation: alert only on the no→yes transition.
No recovery message. People keep asking “is it fixed?”. Mitigation: post on the yes→no transition too.
Webhook URL leaking into logs. Echoing the env var exposes a postable secret. Mitigation: never echo it; keep it in repository secrets only.
Threshold set at the current baseline. An SLO equal to today’s rate fires constantly. Mitigation: set the SLO above the current baseline, then ratchet down.

Reliability targets #

Metric	Target	Notes
Flakiness SLO	≤ 2% over 7d	Rolling-window breach condition
Alert dedupe	1 post per breach	Only on no→yes transition
Time to alert	< 5 min after run	Webhook POST is near-instant
False-alert rate	< 1 / week	Achieved via rolling window
Recovery notice	100% of closed breaches	On yes→no transition

Frequently Asked Questions #

Q: How do I stop the alert from firing on every flaky run? A: Evaluate the breach against a rolling 7-day flake rate rather than a single run, and post only on the transition into breach. A test suite at any non-zero rate will occasionally fail a run without that being an SLO regression.

Q: Where should the Slack webhook URL live? A: In repository secrets, as SLACK_WEBHOOK_URL. The URL is itself the credential — anyone with it can post to the channel — so it is never inlined or echoed to logs.

Q: What should happen to a suite that breaches repeatedly? A: Stop alerting and start quarantining. A chronically flaky suite is a backlog item, not a recurring page; route it into an auto-quarantine workflow so the channel stays meaningful.