Embedding Flakiness Summaries in

Parsed results become a markdown table appended to the step summary and rendered on the run page.

Root cause #

Test reliability data is most actionable in the seconds after a run finishes, while the author still has context on what they changed. A reviewer who has to leave the pull request, open a separate dashboard, filter to the right branch, and correlate timestamps will simply not do it. GitHub Actions solves the location problem with the step summary: any text appended to the file path in $GITHUB_STEP_SUMMARY is rendered as markdown on the run’s summary tab and linked from the checks UI. The mechanism is a plain file append — there is no API, no auth, no rate limit — so it is the cheapest possible place to publish a per-run reliability snapshot.

The catch is that the summary is per-run and ephemeral in the sense that it lives with that workflow run; it is not aggregated across history. That is the correct division of labor. Use the step summary for the immediate “is this run flaky?” question, and ship the same parsed numbers to a durable store such as streaming flakiness metrics to Datadog for the cross-run trend.

Step-by-step fix #

1. Append a markdown table to the step summary #

After the test job, parse the reporter JSON and append a table. Anything echoed to the $GITHUB_STEP_SUMMARY file path is rendered.

# .github/workflows/test.yml
- name: Write flakiness summary
  if: always()
  run: |
    echo "## Flakiness summary" >> "$GITHUB_STEP_SUMMARY"
    echo "" >> "$GITHUB_STEP_SUMMARY"
    echo "| Suite | Total | Flaky | Rate |" >> "$GITHUB_STEP_SUMMARY"
    echo "| --- | --- | --- | --- |" >> "$GITHUB_STEP_SUMMARY"
    # Trade-off: parsing in shell avoids a Node startup cost but is brittle —
    # for nested reporters prefer a small script (step 2) over jq one-liners.
    jq -r '.suites[] | "| \(.title) | \(.specs|length) | \([.specs[]|select(.tests[].results|length>1)]|length) | \(((([.specs[]|select(.tests[].results|length>1)]|length) / (.specs|length)) * 100) | floor)% |"' \
      results.json >> "$GITHUB_STEP_SUMMARY"

2. Build the summary from a script for richer tables #

For nested reporters or computed columns, generate the markdown in Node and append the rendered string. This keeps the parsing testable.

// scripts/summary.js — run with: node scripts/summary.js >> "$GITHUB_STEP_SUMMARY"
// Trade-off: a Node script adds ~300ms startup but is far less brittle than jq
// for deeply nested Playwright/Jest result trees.
import { readFileSync, appendFileSync } from 'node:fs';

const data = JSON.parse(readFileSync('results.json', 'utf8'));
const rows = data.suites.map((s) => {
  const flaky = s.specs.filter((spec) => spec.tests.some((t) => t.results.length > 1)).length;
  const rate = ((flaky / s.specs.length) * 100).toFixed(1);
  const icon = rate > 5 ? '🔴' : rate > 0 ? '🟡' : '🟢';
  return `| ${s.title} | ${s.specs.length} | ${flaky} | ${icon} ${rate}% |`;
});
const md = ['## Flakiness summary', '', '| Suite | Total | Flaky | Rate |', '| --- | --- | --- | --- |', ...rows, ''].join('\n');
appendFileSync(process.env.GITHUB_STEP_SUMMARY, md);

3. Add run context and a badge link #

Use the github context to stamp the summary with the branch, actor, and run, and link a Shields-style badge that reflects the worst suite.

- name: Stamp run context
  if: always()
  run: |
    # Trade-off: embedding context inline avoids an extra artifact upload but
    # the values are frozen at run time — re-runs overwrite them.
    echo "_Branch \`${{ github.ref_name }}\` · run [#${{ github.run_number }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}) · by @${{ github.actor }}_" >> "$GITHUB_STEP_SUMMARY"
    COLOR=$([ "$(cat rate.txt)" -gt 5 ] && echo red || echo brightgreen)
    echo "![flakiness](https://img.shields.io/badge/flaky-$(cat rate.txt)%25-${COLOR})" >> "$GITHUB_STEP_SUMMARY"

4. Gate the merge on the summarized rate #

Fail the job when the parsed rate crosses your threshold so the summary is not merely informational. Feeding the same number into quarantine logic ties this into building auto-quarantine workflows.

- name: Enforce flakiness gate
  if: always()
  run: |
    RATE=$(cat rate.txt)
    # Trade-off: a hard gate blocks merges (CI cost) but prevents flaky debt;
    # set the threshold above your current baseline to avoid blocking everything.
    if [ "$RATE" -gt 5 ]; then
      echo "::error::Flakiness ${RATE}% exceeds 5% threshold"
      exit 1
    fi

Pitfalls #

Appending only on success. if: success() skips the runs that were flaky. Mitigation: always use if: always() on summary steps.
Overwriting instead of appending. Using > truncates earlier sections from other steps. Mitigation: always use >> against $GITHUB_STEP_SUMMARY.
Brittle jq on nested reporters. Deep Playwright trees break shell parsing. Mitigation: move computation into a tested Node script (step 2).
Summary with no durable trend. The step summary cannot answer “is this getting worse?”. Mitigation: also stream the numbers to a metrics store.
Badge cached by GitHub’s proxy. Shields badges are camo-cached and may lag. Mitigation: treat the badge as indicative; rely on the table for the exact rate.

Reliability targets #

Metric	Target	Notes
Summary write overhead	< 2 s	Appended after the test job
Flakiness gate threshold	5% per suite	Fails the job above this
Summary coverage	100% of runs	Guaranteed by `if: always()`
`main` pass rate	≥ 99%	Stricter than feature branches
Time to see flaky data	< 0 extra clicks	Rendered on the run page itself

Frequently Asked Questions #

Q: Why use $GITHUB_STEP_SUMMARY instead of a PR comment? A: The step summary needs no token, no API call, and no comment-spam cleanup. It renders on the run page automatically and is the lowest-friction place to publish per-run reliability data.

Q: Can I include images or badges in the summary? A: Yes — the summary renders GitHub-flavored markdown, so Shields-style badge images and emoji status icons work. Note GitHub proxies and caches external images, so the badge value may lag the table slightly.

Q: How do I keep history if the summary is per-run? A: The summary is intentionally per-run. Emit the same parsed numbers as metrics to a durable store for trends, and keep the summary for the immediate run-level snapshot.