ENGINEERING

Bench a new Replicate version against a Hugging Face eval dataset

On manual launch, pulls a versioned evaluation dataset from Hugging Face, scores a candidate Replicate model version case-by-case, writes the full results table to Postgres.

CategoryEngineering
Enginesim
Difficultyintermediate
Triggermanual
Steps6
Setup~15 min

How it runs

The automated pipeline, trigger to output.

  • TriggerManual launch with version + dataset revision
  • ActionDownload pinned eval datasetHugging FaceHugging Face
  • ActionScore candidate version case-by-caseReplicateReplicate
  • ActionPersist results to bench-history tablePostgreSQLPostgres
  • LogicEvaluate aggregate metrics vs acceptance bar
  • OutputPost pass/fail scorecard to SlackSlack

What it does

Produces a reproducible scorecard for a candidate Replicate version using a versioned Hugging Face evaluation dataset as the source of truth. It scores every case, persists the full results so runs are comparable over time, and returns a clear pass/fail summary against your acceptance bar.

When to use it

Use it when you want an on-demand bench you can trigger before any promotion decision, with the eval set governed in Hugging Face so reviewers can audit exactly what the model was tested on. Good for pre-release sign-off and for comparing two candidate versions.

How it works

  1. 1An operator launches the run manually, passing the candidate Replicate version and the Hugging Face dataset revision.
  2. 2The flow downloads the pinned dataset revision from Hugging Face.
  3. 3It runs each case through the candidate Replicate version and collects scores and latency.
  4. 4It writes the per-case results and run metadata to a Postgres bench-history table.
  5. 5A logic step evaluates aggregate metrics against the acceptance threshold to set pass or fail.
  6. 6It posts the scorecard with the dataset revision and run link to Slack.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect Hugging FaceModels, datasets, spaces — the open-source hub.
  2. 2
    Connect ReplicateImage, video, and model inference.
  3. 3
    Connect PostgresAny Postgres URL — query, write, migrate.
  4. 4
    Connect SlackChannels, DMs, threads, mentions.
  5. 5
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  6. 6
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  7. 7
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.