ENGINEERING

Datadog CI Visibility Flaky Spike to PagerDuty

Subscribes to Datadog CI Visibility flaky-test events and, when a test's flaky rate spikes past a guardrail, pages the owning team via PagerDuty and files a GitHub issue…

CategoryEngineering
Enginesim
Difficultyintermediate
Triggerevent
Steps4
Setup~15 min

How it runs

The automated pipeline, trigger to output.

  • TriggerDatadog CI Visibility flaky eventDatadogDatadog
  • LogicCompare spike to on-call guardrail
  • ActionPage owning team on breachPagerDutyPagerDuty
  • OutputFile GitHub issue with Datadog trace linkGitHubGitHub

What it does

This workflow reacts to flaky-test signals from Datadog CI Visibility. When Datadog reports a test whose flaky rate has spiked beyond a guardrail, the bot decides whether it crosses the on-call threshold, pages the owning team through PagerDuty if so, and always files a GitHub tracking issue linking back to the Datadog test page and recent failed traces.

When to use it

Use this when flakiness in critical paths needs an immediate human, not just a backlog ticket. It routes high-impact spikes to on-call while routing everything else to a tracked issue.

How it works

  1. 1A Datadog CI Visibility webhook event fires for a flaky test.
  2. 2The bot reads the test's flaky rate, owning team tag, and recent failure traces from the event payload.
  3. 3A decision step checks the spike against the guardrail and the team's on-call policy.
  4. 4If it breaches the on-call threshold, it triggers a PagerDuty incident routed to the owning team's service.
  5. 5Regardless, it creates a GitHub issue with the Datadog deep link, flaky rate, and trace references for the team to triage.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect DatadogMetrics, traces, log search.
  2. 2
    Connect PagerDutyIncidents, on-call, escalations.
  3. 3
    Connect GitHubRepos, issues, pull requests, actions.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.