DEVOPS

Replicate Cold-Start SLO Breach to PagerDuty

Triggers on a Datadog cold-start latency monitor alert, confirms the breach against live Replicate metrics.

CategoryDevOps
Enginesim
Difficultyintermediate
Triggerevent
Steps4
Setup~15 min

How it runs

The automated pipeline, trigger to output.

  • TriggerDatadog cold-start latency monitor alertDatadogDatadog
  • ActionConfirm breach against live Replicate predictionsReplicateReplicate
  • LogicSustained breach across multiple predictions?
  • OutputOpen PagerDuty incident with contextPagerDutyPagerDuty

What it does

This workflow turns a noisy latency signal into a trustworthy page. When Datadog detects that Replicate cold-start latency has breached your SLO, the flow double-checks the breach against live Replicate prediction data before deciding whether to wake anyone up. Transient single-request spikes are dropped; sustained degradation escalates to PagerDuty.

When to use it

Use it when a Replicate endpoint backs a production feature with a latency SLO and you want on-call paged for genuine cold-start storms — not every momentary blip a raw monitor would fire on.

How it works

A Datadog monitor webhook triggers the flow on a latency-threshold alert. An action queries Replicate for the most recent predictions to confirm cold starts are actually elevated right now. A logic step requires the breach to be sustained across multiple recent predictions, not a one-off. If confirmed, the flow opens a PagerDuty incident with the measured latency, model version, and a link back to the Datadog graph. If it was a false alarm, it resolves quietly without paging.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect DatadogMetrics, traces, log search.
  2. 2
    Connect ReplicateImage, video, and model inference.
  3. 3
    Connect PagerDutyIncidents, on-call, escalations.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.