DEVOPS

Flaky Terraform Apply Auto-Retry with PagerDuty Escalation

Watches for failed Terraform apply runs, automatically retries transient infra failures with backoff, and escalates to PagerDuty only when retries are exhausted on a real error.

CategoryDevOps
Enginesim
Difficultyadvanced
Triggerwebhook
Steps5
Setup~25 min

How it runs

The automated pipeline, trigger to output.

  • TriggerWebhook: terraform apply failedHTTP webhook
  • LogicClassify error: transient vs real
  • ActionRetry apply with backoffShell
  • LogicCheck retry result and budget
  • OutputOpen PagerDuty incident on real failurePagerDutyPagerDuty

What it does

When a Terraform apply fails, this workflow inspects the error, distinguishes transient flakiness (rate limits, eventual-consistency races, lock contention) from genuine config errors, retries the transient ones with exponential backoff, and pages on-call via PagerDuty only when the failure is real or retries run out.

When to use it

Use it when your apply pipeline fails intermittently on provider rate limits or resource-not-yet-ready errors, and you want self-healing retries instead of a human re-running the job at 3am — while still guaranteeing a page for failures that actually need eyes.

How it works

  1. 1A webhook trigger receives the apply-failed event from your CI pipeline with the error log attached.
  2. 2A logic step classifies the error against a transient-pattern list (429s, lock timeouts, dependency-not-ready).
  3. 3If transient and retry budget remains, a shell action re-runs `terraform apply` after a backoff delay.
  4. 4A logic step checks the retry result: success ends the run cleanly.
  5. 5On a non-transient error or an exhausted retry budget, an output step opens a PagerDuty incident with the failing resource, error class, and run link.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect HTTP webhookTrigger any URL on agent actions.
  2. 2
    Connect ShellRun sandboxed commands inside the workspace.
  3. 3
    Connect PagerDutyIncidents, on-call, escalations.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.