DEVOPS

Escalate flaky-test spikes that block deploys to PagerDuty

When Datadog detects a sudden surge in flaky failures on the main branch, checks whether deploys are blocked; if so, pages the on-call engineer via PagerDuty and posts a war-room…

CategoryDevOps
Enginesim
Difficultyadvanced
Triggerevent
Steps6
Setup~25 min

How it runs

The automated pipeline, trigger to output.

  • TriggerDatadog monitor: main-branch flake spikeDatadogDatadog
  • ActionCheck deploy pipeline blocked status
  • LogicDeploys blocked and above incident threshold?
  • ActionCreate PagerDuty incident for CI on-callPagerDutyPagerDuty
  • ActionOpen Slack war-room thread with top offendersSlack
  • OutputLink incident and Datadog dashboard in threadSlack

What it does

Detects when flakiness crosses from annoyance into deploy-blocking incident. A Datadog monitor on main-branch flake rate triggers the flow, which confirms that the release pipeline is actually stalled, then escalates: it pages on-call through PagerDuty and opens a Slack war-room thread listing the tests driving the spike so the team can quarantine fast.

When to use it

Use this when a burst of new flakes can wedge your deploy pipeline and you need humans engaged in minutes, not at the next standup. It separates routine flake noise from true delivery-blocking events.

How it works

  1. 1A Datadog monitor alert fires on a main-branch flake-rate spike.
  2. 2The flow checks the deploy pipeline status to confirm releases are blocked, not merely slowed.
  3. 3A branch decides severity: page only if deploys are blocked and the spike exceeds the incident threshold.
  4. 4It creates a PagerDuty incident assigned to the CI on-call rotation.
  5. 5It opens a Slack war-room thread with the top offending tests and their failure counts.
  6. 6The final step links the PagerDuty incident and Datadog dashboard into the Slack thread for the responder.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect DatadogMetrics, traces, log search.
  2. 2
    Connect PagerDutyIncidents, on-call, escalations.
  3. 3
    Connect SlackChannels, DMs, threads, mentions.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.