Root cause #
Test reliability data is most actionable in the seconds after a run finishes, while the author still has context on what they changed. A reviewer who has to leave the pull request, open a separate dashboard, filter to the right branch, and correlate timestamps will simply not do it. GitHub Actions solves the location problem with the step summary: any text appended to the file path in $GITHUB_STEP_SUMMARY is rendered as markdown on the run’s summary tab and linked from the checks UI. The mechanism is a plain file append — there is no API, no auth, no rate limit — so it is the cheapest possible place to publish a per-run reliability snapshot.
The catch is that the summary is per-run and ephemeral in the sense that it lives with that workflow run; it is not aggregated across history. That is the correct division of labor. Use the step summary for the immediate “is this run flaky?” question, and ship the same parsed numbers to a durable store such as streaming flakiness metrics to Datadog for the cross-run trend.
Step-by-step fix #
1. Append a markdown table to the step summary #
After the test job, parse the reporter JSON and append a table. Anything echoed to the $GITHUB_STEP_SUMMARY file path is rendered.
# .github/workflows/test.yml
- name: Write flakiness summary
if: always()
run: |
echo "## Flakiness summary" >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"
echo "| Suite | Total | Flaky | Rate |" >> "$GITHUB_STEP_SUMMARY"
echo "| --- | --- | --- | --- |" >> "$GITHUB_STEP_SUMMARY"
# Trade-off: parsing in shell avoids a Node startup cost but is brittle —
# for nested reporters prefer a small script (step 2) over jq one-liners.
jq -r '.suites[] | "| \(.title) | \(.specs|length) | \([.specs[]|select(.tests[].results|length>1)]|length) | \(((([.specs[]|select(.tests[].results|length>1)]|length) / (.specs|length)) * 100) | floor)% |"' \
results.json >> "$GITHUB_STEP_SUMMARY"
2. Build the summary from a script for richer tables #
For nested reporters or computed columns, generate the markdown in Node and append the rendered string. This keeps the parsing testable.
// scripts/summary.js — run with: node scripts/summary.js >> "$GITHUB_STEP_SUMMARY"
// Trade-off: a Node script adds ~300ms startup but is far less brittle than jq
// for deeply nested Playwright/Jest result trees.
import { readFileSync, appendFileSync } from 'node:fs';
const data = JSON.parse(readFileSync('results.json', 'utf8'));
const rows = data.suites.map((s) => {
const flaky = s.specs.filter((spec) => spec.tests.some((t) => t.results.length > 1)).length;
const rate = ((flaky / s.specs.length) * 100).toFixed(1);
const icon = rate > 5 ? '🔴' : rate > 0 ? '🟡' : '🟢';
return `| ${s.title} | ${s.specs.length} | ${flaky} | ${icon} ${rate}% |`;
});
const md = ['## Flakiness summary', '', '| Suite | Total | Flaky | Rate |', '| --- | --- | --- | --- |', ...rows, ''].join('\n');
appendFileSync(process.env.GITHUB_STEP_SUMMARY, md);
3. Add run context and a badge link #
Use the github context to stamp the summary with the branch, actor, and run, and link a Shields-style badge that reflects the worst suite.
- name: Stamp run context
if: always()
run: |
# Trade-off: embedding context inline avoids an extra artifact upload but
# the values are frozen at run time — re-runs overwrite them.
echo "_Branch \`${{ github.ref_name }}\` · run [#${{ github.run_number }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}) · by @${{ github.actor }}_" >> "$GITHUB_STEP_SUMMARY"
COLOR=$([ "$(cat rate.txt)" -gt 5 ] && echo red || echo brightgreen)
echo "%25-${COLOR})" >> "$GITHUB_STEP_SUMMARY"
4. Gate the merge on the summarized rate #
Fail the job when the parsed rate crosses your threshold so the summary is not merely informational. Feeding the same number into quarantine logic ties this into building auto-quarantine workflows.
- name: Enforce flakiness gate
if: always()
run: |
RATE=$(cat rate.txt)
# Trade-off: a hard gate blocks merges (CI cost) but prevents flaky debt;
# set the threshold above your current baseline to avoid blocking everything.
if [ "$RATE" -gt 5 ]; then
echo "::error::Flakiness ${RATE}% exceeds 5% threshold"
exit 1
fi
Pitfalls #
- Appending only on success.
if: success()skips the runs that were flaky. Mitigation: always useif: always()on summary steps. - Overwriting instead of appending. Using
>truncates earlier sections from other steps. Mitigation: always use>>against$GITHUB_STEP_SUMMARY. - Brittle jq on nested reporters. Deep Playwright trees break shell parsing. Mitigation: move computation into a tested Node script (step 2).
- Summary with no durable trend. The step summary cannot answer “is this getting worse?”. Mitigation: also stream the numbers to a metrics store.
- Badge cached by GitHub’s proxy. Shields badges are camo-cached and may lag. Mitigation: treat the badge as indicative; rely on the table for the exact rate.
Reliability targets #
| Metric | Target | Notes |
|---|---|---|
| Summary write overhead | < 2 s | Appended after the test job |
| Flakiness gate threshold | 5% per suite | Fails the job above this |
| Summary coverage | 100% of runs | Guaranteed by if: always() |
main pass rate |
≥ 99% | Stricter than feature branches |
| Time to see flaky data | < 0 extra clicks | Rendered on the run page itself |
Frequently Asked Questions #
Q: Why use $GITHUB_STEP_SUMMARY instead of a PR comment?
A: The step summary needs no token, no API call, and no comment-spam cleanup. It renders on the run page automatically and is the lowest-friction place to publish per-run reliability data.
Q: Can I include images or badges in the summary? A: Yes — the summary renders GitHub-flavored markdown, so Shields-style badge images and emoji status icons work. Note GitHub proxies and caches external images, so the badge value may lag the table slightly.
Q: How do I keep history if the summary is per-run? A: The summary is intentionally per-run. Emit the same parsed numbers as metrics to a durable store for trends, and keep the summary for the immediate run-level snapshot.