AI AGENTS
Telemetry Cost-Spike Triage from PagerDuty Incident
When a telemetry-budget PagerDuty incident fires, an agent correlates the spike to the responsible Datadog metric or Honeycomb dataset and posts a root-cause triage…
How it runs
The automated pipeline, trigger to output.
- TriggerPagerDuty telemetry-cost incident firesPagerDuty
- ActionRead incident start time and query contextPagerDuty
- ActionQuery Datadog and Honeycomb around the spike windowDatadog
- LogicPick most likely culprit and draft drop rule
- OutputAppend root-cause note and proposed rule to incidentPagerDuty
What it does
When an observability-cost budget alarm escalates to PagerDuty, this agent does the first triage for the on-call engineer. It looks at the moment the spike began, finds the Datadog metric or Honeycomb dataset whose cardinality or volume jumped, and writes a plain-language root cause plus a candidate drop or aggregate rule, all attached to the incident before a human even opens their laptop.
When to use it
Use it when telemetry cost spikes page someone and the on-call wastes the first twenty minutes figuring out which service shipped a noisy new tag. It compresses triage into a single incident note.
How it works
- 1A PagerDuty incident with a telemetry-cost service tag triggers the run.
- 2The agent reads the incident's start time and any included query context.
- 3It queries both Datadog and Honeycomb around that window to locate the metric or dataset with the abnormal cardinality or volume jump.
- 4A logic step picks the single most likely culprit and drafts a targeted drop or aggregate recommendation.
- 5It appends the root-cause summary and proposed rule as a note on the PagerDuty incident.
Set it up
What you configure once, before turning it on.
- 1Connect PagerDutyIncidents, on-call, escalations.
- 2Connect DatadogMetrics, traces, log search.
- 3Connect HoneycombDistributed traces and queries.
- 4Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
- 5Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
- 6Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.
More AI Agents workflows
Custom Metrics Cardinality Spike Pager
A webhook from a Datadog monitor fires when custom-metric cardinality jumps; an agent pinpoints the offending metric and tag, estimates the added cost.
Sentry-to-Confluence Runbook Updater
When a Sentry issue is resolved, the agent finds the matching Confluence runbook page and proposes an inline update with the verified fix.
Stale Doc-PR Chaser for Runbook Gaps
On a daily schedule the agent finds runbook doc PRs that were opened from resolved incidents but never reviewed, summarizes what each one fixes.
Resolved Incident to Public Troubleshooting Doc
For customer-facing errors resolved in Sentry, the agent drafts a sanitized troubleshooting entry and opens a PR to your ReadMe documentation.
On-Call Runbook Gap Closer: Resolved Sentry Issues to Doc PRs
An agent reads each newly resolved Sentry issue, compares the actual fix against your existing runbook, and opens a GitHub PR adding the missing remediation steps.
Weekly On-Call Doc-Gap Digest
Each week the agent reviews every Sentry issue resolved in the last 7 days, ranks the ones whose runbook coverage is missing or thin.
Run it inside a business
This workflow drops into a full company template. Import the org, and this is one of the playbooks its agents run.

Run this workflow in your colony.
14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
