Root cause #
An alert that fires on every flaky run trains people to ignore it. Flakiness is, by definition, intermittent — a suite at a 3% flake rate will produce a failing run roughly one time in thirty even when nothing has regressed. If the alert condition is “this run had a flaky test,” the channel fills with single-run noise and the signal of a genuine SLO regression is lost. The mechanism that fixes this has two parts: evaluate the breach against a rolling aggregate rather than a single run, and dedupe so the same open breach does not re-post on every subsequent pipeline.
Slack incoming webhooks are the right transport because they are a single authenticated URL with no OAuth scopes to manage and no per-message rate ceiling that CI will realistically hit. The webhook URL is a secret — anyone holding it can post to the channel — so it lives in repository secrets as SLACK_WEBHOOK_URL, never inlined. The hard part is not sending the message; it is deciding whether to send it at all.
Step-by-step fix #
1. Compute the breach against a rolling window #
Compare the rolling flake rate to the SLO, not the latest run. A single run is too noisy to gate an alert.
// scripts/evaluate-slo.js
// Trade-off: a longer window is more stable but slower to react to a real
// regression; 7 days balances responsiveness against single-run noise.
import { readFileSync } from 'node:fs';
const SLO = 2.0; // percent of executions allowed to be flaky
const history = JSON.parse(readFileSync('flake-history.json', 'utf8')); // last 7d of runs
const flaky = history.reduce((n, r) => n + r.flaky, 0);
const total = history.reduce((n, r) => n + r.total, 0);
const rate = (flaky / total) * 100;
const breached = rate > SLO;
console.log(JSON.stringify({ rate: rate.toFixed(2), slo: SLO, breached }));
process.exit(breached ? 1 : 0);
2. Dedupe so an open breach does not re-fire #
Persist a marker (a cache key or a tiny state file in an artifact) so a breach that is already open is not re-announced every pipeline. Only post on the transition into breach.
// scripts/should-alert.js — true only on the no→yes transition
// Trade-off: state in CI cache can be lost on eviction, causing a re-alert;
// acceptable because a re-alert on an open breach is rarely harmful.
import { existsSync, writeFileSync, rmSync } from 'node:fs';
const breached = process.argv[2] === 'true';
const MARKER = '.flaky-breach-open';
const wasOpen = existsSync(MARKER);
if (breached && !wasOpen) { writeFileSync(MARKER, '1'); console.log('alert'); }
else if (!breached && wasOpen) { rmSync(MARKER); console.log('recovered'); }
else { console.log('skip'); }
3. Post the Slack webhook from GitHub Actions #
Send a Block Kit payload only when step 2 decided to alert, reading the webhook from the SLACK_WEBHOOK_URL secret.
# .github/workflows/flaky-slo.yml
- name: Alert Slack on SLO breach
if: steps.dedupe.outputs.decision == 'alert'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
RATE: ${{ steps.evaluate.outputs.rate }}
RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
run: |
# Trade-off: a rich Block Kit payload is more actionable but harder to
# template; keep it to one section + one button to stay maintainable.
curl -sS -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"blocks\":[
{\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\":rotating_light: *Flakiness SLO breached* — rate *${RATE}%* exceeds 2% on \`${GITHUB_REF_NAME}\`\"}},
{\"type\":\"actions\",\"elements\":[{\"type\":\"button\",\"text\":{\"type\":\"plain_text\",\"text\":\"View run\"},\"url\":\"${RUN_URL}\"}]}
]}"
4. Send a recovery notice #
Closing the loop matters as much as opening it. Post a recovery message on the yes→no transition so the channel knows the breach cleared, and route persistent offenders into building auto-quarantine workflows rather than alerting forever.
- name: Post recovery
if: steps.dedupe.outputs.decision == 'recovered'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
run: |
# Trade-off: recovery notices add volume but prevent stale "is it fixed?"
# pings; keep them terse.
curl -sS -X POST "$SLACK_WEBHOOK_URL" -H 'Content-Type: application/json' \
-d '{"text":":white_check_mark: Flakiness SLO recovered — rate back under 2%."}'
Pitfalls #
- Alerting per run. Single-run failures are expected at any non-zero flake rate. Mitigation: evaluate against a rolling window, not the latest run.
- No deduping. Re-posting an open breach every pipeline gets the channel muted. Mitigation: alert only on the no→yes transition.
- No recovery message. People keep asking “is it fixed?”. Mitigation: post on the yes→no transition too.
- Webhook URL leaking into logs. Echoing the env var exposes a postable secret. Mitigation: never echo it; keep it in repository secrets only.
- Threshold set at the current baseline. An SLO equal to today’s rate fires constantly. Mitigation: set the SLO above the current baseline, then ratchet down.
Reliability targets #
| Metric | Target | Notes |
|---|---|---|
| Flakiness SLO | ≤ 2% over 7d | Rolling-window breach condition |
| Alert dedupe | 1 post per breach | Only on no→yes transition |
| Time to alert | < 5 min after run | Webhook POST is near-instant |
| False-alert rate | < 1 / week | Achieved via rolling window |
| Recovery notice | 100% of closed breaches | On yes→no transition |
Frequently Asked Questions #
Q: How do I stop the alert from firing on every flaky run? A: Evaluate the breach against a rolling 7-day flake rate rather than a single run, and post only on the transition into breach. A test suite at any non-zero rate will occasionally fail a run without that being an SLO regression.
Q: Where should the Slack webhook URL live?
A: In repository secrets, as SLACK_WEBHOOK_URL. The URL is itself the credential — anyone with it can post to the channel — so it is never inlined or echoed to logs.
Q: What should happen to a suite that breaches repeatedly? A: Stop alerting and start quarantining. A chronically flaky suite is a backlog item, not a recurring page; route it into an auto-quarantine workflow so the channel stays meaningful.