ENGINEERING

Shadow-traffic A/B bench two Replicate versions before cutover

On a schedule, replays a sampled slice of recent production inputs through both the live and candidate Replicate versions, compares quality and cost side by side.

CategoryEngineering
Enginesim
Difficultyadvanced
Triggerschedule
Steps5
Setup~25 min

How it runs

The automated pipeline, trigger to output.

  • TriggerScheduled run pulls sampled production inputsGoogle BigQueryBigQuery
  • ActionReplay slice through live + candidate versionsReplicateReplicate
  • LogicCompare quality, latency, cost head to head
  • ActionWrite A/B comparison reportGoogle BigQueryBigQuery
  • OutputPost cutover recommendation to SlackSlack

What it does

Runs a shadow A/B comparison so you can judge a candidate Replicate version on real recent traffic, not just a synthetic bench. It replays a sampled slice of production inputs through both the live and candidate versions, scores them head to head on quality, latency, and cost, and produces a cutover recommendation.

When to use it

Use it before swapping production traffic to a new version when you need confidence on representative inputs and want to weigh quality against inference cost. Ideal for cost-sensitive endpoints where a marginally better model may not justify a price increase.

How it works

  1. 1A schedule trigger pulls a sampled slice of recent production inputs from BigQuery.
  2. 2The flow replays each input through the live Replicate version and the candidate version.
  3. 3It scores both outputs head to head and tallies latency and per-call cost.
  4. 4A logic step decides recommend-cutover or hold based on the quality-versus-cost trade-off.
  5. 5It writes the full A/B comparison report to BigQuery for the record.
  6. 6It posts the side-by-side summary and recommendation to Slack.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect BigQueryDatasets, queries, schemas.
  2. 2
    Connect ReplicateImage, video, and model inference.
  3. 3
    Connect SlackChannels, DMs, threads, mentions.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.