ENGINEERING

Smart CI Rerun with Escalation to PagerDuty

On a CI failure, automatically reruns only the failed jobs once; if they pass it labels the failure flaky, but if they fail again on an unchanged commit it pages the on-call…

CategoryEngineering
Enginesim
Difficultyadvanced
Triggerevent
Steps5
Setup~25 min

How it runs

The automated pipeline, trigger to output.

  • TriggerGitHub check failure on protected branchGitHubGitHub
  • ActionRerun only failed jobsGitHubGitHub
  • LogicCompare rerun result: flake vs regression
  • ActionLabel run as flaky on passGitHubGitHub
  • OutputPage on-call via PagerDuty on repeat failurePagerDutyPagerDuty

What it does

This workflow distinguishes flakes from real breakage automatically. It reruns failed jobs a single time. A pass on rerun is logged as a flake and the build is unblocked; a second consecutive failure on the same commit is treated as a genuine regression and escalated to on-call so a person looks immediately, with no wasted manual reruns.

When to use it

Use this on protected branches where a red build is either a flake or a true regression and you need fast, automatic disambiguation without an engineer babysitting the pipeline.

How it works

  1. 1A GitHub check_run failure on a protected branch triggers the flow.
  2. 2An action reruns only the failed jobs via the GitHub API.
  3. 3A logic step waits for the rerun result and compares against the prior attempt.
  4. 4If it passed, an action labels the run flaky and records it.
  5. 5If it failed again, the output creates a PagerDuty incident for the on-call engineer.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect GitHubRepos, issues, pull requests, actions.
  2. 2
    Connect PagerDutyIncidents, on-call, escalations.
  3. 3
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  4. 4
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  5. 5
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.