DEVOPS
Cold-Start Triage Agent for Replicate Endpoints
On a PagerDuty cold-start incident, an agent gathers Replicate timings and Datadog metrics, decides whether to warm the pool or hand off to a human, applies the fix.
How it runs
The automated pipeline, trigger to output.
- TriggerPagerDuty cold-start incident openedPagerDuty
- ActionGather Replicate timings + Datadog latency trendReplicate
- LogicAgent decides: cold-start surge vs model fault
- ActionWarm pool and confirm recovery (if surge)Replicate
- OutputWrite triage note back to PagerDuty incidentPagerDuty
What it does
This workflow runs an agent-driven triage on Replicate cold-start incidents. Rather than a fixed pipeline, the agent reasons over live signals — Replicate prediction timings and Datadog latency history — to judge root cause, then either auto-warms the pool or escalates with a recommendation. It documents its reasoning directly on the PagerDuty incident.
When to use it
Use it when cold-start incidents need judgment, not just a reflex: distinguishing a genuine traffic-driven cold start (warm and resolve) from an upstream model error or version rollout (don't warm, escalate). Best for teams that want a first-responder that thinks before acting.
How it works
A PagerDuty incident trigger starts the agent. It pulls recent Replicate predictions to inspect boot times and failure modes, and queries Datadog for the latency trend leading into the incident. The agent decides: if it's a clean cold-start surge, it submits warm-up predictions to Replicate and confirms recovery; if signals point to a model fault or bad deploy, it skips warming. Finally it posts a triage note to the PagerDuty incident — what it found, what it did, and the recommended next step for on-call.
Set it up
What you configure once, before turning it on.
- 1Connect PagerDutyIncidents, on-call, escalations.
- 2Connect ReplicateImage, video, and model inference.
- 3Connect DatadogMetrics, traces, log search.
- 4Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
- 5Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
- 6Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.
More DevOps workflows
Block costly Hugging Face Space hardware upgrades in PR review
When a pull request changes a Space's hardware config, it estimates the new monthly cost and posts a GitHub PR comment that flags upgrades crossing a budget ceiling.
Auto-spin a Zoom war-room when PagerDuty hits SEV-1
When a PagerDuty incident escalates to a critical severity, this workflow creates a dedicated Zoom meeting and posts the bridge link to the incident's Slack channel so responders…
Page on-call when a Hugging Face Space build is stuck or errored
Polls Hugging Face Space runtime status on a schedule and opens a PagerDuty incident when a Space sits in a build or error state past a deadline, with a Slack heads-up.
Slack-approved pause for idle Hugging Face Spaces
On a daily scan it finds idle paid Spaces and posts an interactive Slack approval; on approve it pauses the Space and logs the decision to a GitHub issue audit trail.
Hugging Face Spaces idle-runtime sweep with auto-pause
On a schedule, scans all Hugging Face Spaces for ones running idle past a threshold, pauses them to stop billing, and posts a Slack summary with the estimated monthly savings.
Open a Zoom war-room from a Datadog multi-alert storm
When a Datadog monitor crosses a critical threshold, this workflow dedupes against active incidents, and only for a genuinely new outage it creates a Zoom bridge.
Run it inside a business
This workflow drops into a full company template. Import the org, and this is one of the playbooks its agents run.

Run this workflow in your colony.
14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
