ENGINEERING

Flake Evidence Collector: Auto-Rerun a Suspect Test N Times to Confirm Flakiness

On a single test failure, dispatches the same test in isolation several times to measure its real failure rate.

CategoryEngineering
Enginesim
Difficultyadvanced
Triggerwebhook
Steps5
Setup~25 min

How it runs

The automated pipeline, trigger to output.

  • TriggerGitHub webhook: single test failureGitHubGitHub
  • ActionDispatch N isolated reruns of the testGitHubGitHub
  • LogicTally rerun pass/fail ratio; classify hard-fail vs. flaky
  • ActionSummarize flakiness with rerun evidenceOpenAI
  • OutputOpen Linear flake ticket + draft skip MR with evidenceLinearLinear

What it does

A single red run proves nothing. This agent reruns the suspect test in isolation multiple times, records the pass/fail pattern, and uses that evidence to decide. A test that fails every rerun is a hard failure and is escalated; one that fails some-but-not-all reruns is confirmed flaky and quarantined with the evidence attached.

When to use it

Use it when you want proof before quarantining, and when your CI lets you dispatch a targeted test run on demand. It eliminates guesswork by generating a real flakiness sample instead of inferring from one failure.

How it works

  1. 1A GitHub webhook fires on a single test failure.
  2. 2The flow dispatches N isolated reruns of just that test via the GitHub API and waits for results.
  3. 3A logic step tallies the pass/fail ratio across the reruns.
  4. 4If the test failed every time, it opens a GitHub regression issue and stops.
  5. 5If it failed intermittently, an OpenAI step writes a flake summary with the rerun evidence.
  6. 6It opens a Linear flake ticket and a draft skip MR, attaching the rerun pass/fail record.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect GitHubRepos, issues, pull requests, actions.
  2. 2
    Connect OpenAIModels, embeddings, files.
  3. 3
    Connect LinearIssues, projects, cycles, triage.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.