DEVOPS

Cold-Start Triage Agent for Replicate Endpoints

On a PagerDuty cold-start incident, an agent gathers Replicate timings and Datadog metrics, decides whether to warm the pool or hand off to a human, applies the fix.

CategoryDevOps
Enginepaperclip
Difficultyadvanced
Triggerevent
Steps5
Setup~25 min

How it runs

The automated pipeline, trigger to output.

  • TriggerPagerDuty cold-start incident openedPagerDutyPagerDuty
  • ActionGather Replicate timings + Datadog latency trendReplicateReplicate
  • LogicAgent decides: cold-start surge vs model fault
  • ActionWarm pool and confirm recovery (if surge)ReplicateReplicate
  • OutputWrite triage note back to PagerDuty incidentPagerDutyPagerDuty

What it does

This workflow runs an agent-driven triage on Replicate cold-start incidents. Rather than a fixed pipeline, the agent reasons over live signals — Replicate prediction timings and Datadog latency history — to judge root cause, then either auto-warms the pool or escalates with a recommendation. It documents its reasoning directly on the PagerDuty incident.

When to use it

Use it when cold-start incidents need judgment, not just a reflex: distinguishing a genuine traffic-driven cold start (warm and resolve) from an upstream model error or version rollout (don't warm, escalate). Best for teams that want a first-responder that thinks before acting.

How it works

A PagerDuty incident trigger starts the agent. It pulls recent Replicate predictions to inspect boot times and failure modes, and queries Datadog for the latency trend leading into the incident. The agent decides: if it's a clean cold-start surge, it submits warm-up predictions to Replicate and confirms recovery; if signals point to a model fault or bad deploy, it skips warming. Finally it posts a triage note to the PagerDuty incident — what it found, what it did, and the recommended next step for on-call.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect PagerDutyIncidents, on-call, escalations.
  2. 2
    Connect ReplicateImage, video, and model inference.
  3. 3
    Connect DatadogMetrics, traces, log search.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.