ENGINEERING

Page on-call when flaky failures spike into a suite meltdown

Monitors the rate of newly quarantined specs and total CI failure volume; when both spike past a threshold in a short window, it pages the on-call engineer and posts a meltdown…

CategoryEngineering
Enginesim
Difficultyintermediate
Triggerwebhook
Steps5
Setup~15 min

How it runs

The automated pipeline, trigger to output.

  • TriggerWebhook: spec quarantined eventHTTP webhook
  • ActionQuery trailing-window quarantine and failure countsPostgreSQLPostgres
  • LogicGate: both metrics exceed meltdown thresholds
  • ActionTrigger PagerDuty incident for on-callPagerDutyPagerDuty
  • OutputPost meltdown alert to SlackSlack

What it does

Distinguishes ordinary background flakiness from a sudden systemic failure — a bad dependency bump, a broken shared fixture, or infra degradation — that masquerades as a flood of "flaky" tests. When the quarantine rate and overall failure volume both spike together, it escalates instead of silently quarantining everything.

When to use it

Use this as a safety net on top of an automated quarantine program, so the system never quietly hides a real outage by quarantining dozens of specs at once.

How it works

  1. 1A webhook fires each time a spec is quarantined, carrying the running counts.
  2. 2The flow queries Postgres for the number of newly quarantined specs and total CI failures in the trailing time window.
  3. 3A logic gate checks whether both the quarantine rate and failure volume exceed their meltdown thresholds simultaneously.
  4. 4If the gate trips, it triggers a PagerDuty incident for the on-call engineer.
  5. 5It also posts a meltdown alert to Slack summarizing the spike and the affected specs so the team can converge immediately.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect HTTP webhookTrigger any URL on agent actions.
  2. 2
    Connect PostgresAny Postgres URL — query, write, migrate.
  3. 3
    Connect PagerDutyIncidents, on-call, escalations.
  4. 4
    Connect SlackChannels, DMs, threads, mentions.
  5. 5
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  6. 6
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  7. 7
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.