AI AGENTS
Nightly Health Sweep with Morning Fix-Approval Queue
On a nightly schedule an agent sweeps Datadog and PagerDuty for degraded-but-not-paging conditions, drafts a remediation for each.
How it runs
The automated pipeline, trigger to output.
- TriggerNightly schedule fires
- ActionPull warning monitors and resolved incidentsDatadog
- ActionAgent drafts remediation per signal cluster
- LogicRank by risk and drop self-healed items
- ActionPost batched approval queue to SlackSlack
- OutputExecute approved fixes and post summaryShell
What it does
Proactively catches the slow-burn problems that never page — creeping disk usage, a flapping monitor, a stale auto-resolved incident — and turns them into a tidy morning checklist. Each item comes with a proposed fix the engineer can approve or skip in one place.
When to use it
Use this to stop low-grade issues from becoming 3am pages. Best for teams who want a predictable start-of-shift ritual: review the queue, approve the safe fixes, defer the rest — instead of discovering the same warnings scattered across dashboards.
How it works
- 1A nightly schedule triggers the sweep.
- 2The agent pulls warning-level Datadog monitors and recently auto-resolved PagerDuty incidents.
- 3It groups related signals and drafts one proposed remediation per cluster.
- 4A logic step ranks items by risk and filters out anything already self-healed.
- 5It posts a single batched approval queue to Slack with per-item Approve / Skip controls.
- 6Approved items execute their shell action; the agent posts a closing summary of what ran.
Set it up
What you configure once, before turning it on.
- 1Connect DatadogMetrics, traces, log search.
- 2Connect PagerDutyIncidents, on-call, escalations.
- 3Connect SlackChannels, DMs, threads, mentions.
- 4Connect ShellRun sandboxed commands inside the workspace.
- 5Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
- 6Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
- 7Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.
More AI Agents workflows
Custom Metrics Cardinality Spike Pager
A webhook from a Datadog monitor fires when custom-metric cardinality jumps; an agent pinpoints the offending metric and tag, estimates the added cost.
Sentry-to-Confluence Runbook Updater
When a Sentry issue is resolved, the agent finds the matching Confluence runbook page and proposes an inline update with the verified fix.
Stale Doc-PR Chaser for Runbook Gaps
On a daily schedule the agent finds runbook doc PRs that were opened from resolved incidents but never reviewed, summarizes what each one fixes.
Resolved Incident to Public Troubleshooting Doc
For customer-facing errors resolved in Sentry, the agent drafts a sanitized troubleshooting entry and opens a PR to your ReadMe documentation.
On-Call Runbook Gap Closer: Resolved Sentry Issues to Doc PRs
An agent reads each newly resolved Sentry issue, compares the actual fix against your existing runbook, and opens a GitHub PR adding the missing remediation steps.
Weekly On-Call Doc-Gap Digest
Each week the agent reviews every Sentry issue resolved in the last 7 days, ranks the ones whose runbook coverage is missing or thin.
Run it inside a business
This workflow drops into a full company template. Import the org, and this is one of the playbooks its agents run.

Run this workflow in your colony.
14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
