DEVOPS
Flaky Terraform Apply Auto-Retry with PagerDuty Escalation
Watches for failed Terraform apply runs, automatically retries transient infra failures with backoff, and escalates to PagerDuty only when retries are exhausted on a real error.
How it runs
The automated pipeline, trigger to output.
- TriggerWebhook: terraform apply failedHTTP webhook
- LogicClassify error: transient vs real
- ActionRetry apply with backoffShell
- LogicCheck retry result and budget
- OutputOpen PagerDuty incident on real failurePagerDuty
What it does
When a Terraform apply fails, this workflow inspects the error, distinguishes transient flakiness (rate limits, eventual-consistency races, lock contention) from genuine config errors, retries the transient ones with exponential backoff, and pages on-call via PagerDuty only when the failure is real or retries run out.
When to use it
Use it when your apply pipeline fails intermittently on provider rate limits or resource-not-yet-ready errors, and you want self-healing retries instead of a human re-running the job at 3am — while still guaranteeing a page for failures that actually need eyes.
How it works
- 1A webhook trigger receives the apply-failed event from your CI pipeline with the error log attached.
- 2A logic step classifies the error against a transient-pattern list (429s, lock timeouts, dependency-not-ready).
- 3If transient and retry budget remains, a shell action re-runs `terraform apply` after a backoff delay.
- 4A logic step checks the retry result: success ends the run cleanly.
- 5On a non-transient error or an exhausted retry budget, an output step opens a PagerDuty incident with the failing resource, error class, and run link.
Set it up
What you configure once, before turning it on.
- 1Connect HTTP webhookTrigger any URL on agent actions.
- 2Connect ShellRun sandboxed commands inside the workspace.
- 3Connect PagerDutyIncidents, on-call, escalations.
- 4Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
- 5Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
- 6Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.
More DevOps workflows
Slack-approved pause for idle Hugging Face Spaces
On a daily scan it finds idle paid Spaces and posts an interactive Slack approval; on approve it pauses the Space and logs the decision to a GitHub issue audit trail.
Block costly Hugging Face Space hardware upgrades in PR review
When a pull request changes a Space's hardware config, it estimates the new monthly cost and posts a GitHub PR comment that flags upgrades crossing a budget ceiling.
Hugging Face Spaces idle-runtime sweep with auto-pause
On a schedule, scans all Hugging Face Spaces for ones running idle past a threshold, pauses them to stop billing, and posts a Slack summary with the estimated monthly savings.
Open a Zoom war-room from a Datadog multi-alert storm
When a Datadog monitor crosses a critical threshold, this workflow dedupes against active incidents, and only for a genuinely new outage it creates a Zoom bridge.
Auto-spin a Zoom war-room when PagerDuty hits SEV-1
When a PagerDuty incident escalates to a critical severity, this workflow creates a dedicated Zoom meeting and posts the bridge link to the incident's Slack channel so responders…
Spin up a war-room on demand from a Slack slash command
When an engineer runs a Slack command, this workflow creates a Zoom bridge, opens a tracking Sentry-linked incident, files a Linear issue for follow-up.
Run it inside a business
This workflow drops into a full company template. Import the org, and this is one of the playbooks its agents run.

Run this workflow in your colony.
14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
