ENGINEERING

Bench a new Replicate version against a Hugging Face eval dataset

On manual launch, pulls a versioned evaluation dataset from Hugging Face, scores a candidate Replicate model version case-by-case, writes the full results table to Postgres.

CategoryEngineering

Enginesim

Difficultyintermediate

Triggermanual

Steps6

Setup~15 min

How it runs

The automated pipeline, trigger to output.

TriggerManual launch with version + dataset revision
ActionDownload pinned eval datasetHugging Face
ActionScore candidate version case-by-caseReplicate
ActionPersist results to bench-history tablePostgres
LogicEvaluate aggregate metrics vs acceptance bar
OutputPost pass/fail scorecard to SlackSlack

What it does

Produces a reproducible scorecard for a candidate Replicate version using a versioned Hugging Face evaluation dataset as the source of truth. It scores every case, persists the full results so runs are comparable over time, and returns a clear pass/fail summary against your acceptance bar.

When to use it

Use it when you want an on-demand bench you can trigger before any promotion decision, with the eval set governed in Hugging Face so reviewers can audit exactly what the model was tested on. Good for pre-release sign-off and for comparing two candidate versions.

How it works

1An operator launches the run manually, passing the candidate Replicate version and the Hugging Face dataset revision.
2The flow downloads the pinned dataset revision from Hugging Face.
3It runs each case through the candidate Replicate version and collects scores and latency.
4It writes the per-case results and run metadata to a Postgres bench-history table.
5A logic step evaluates aggregate metrics against the acceptance threshold to set pass or fail.
6It posts the scorecard with the dataset revision and run link to Slack.

Set it up

What you configure once, before turning it on.

1
Connect Hugging FaceModels, datasets, spaces — the open-source hub.
2
Connect ReplicateImage, video, and model inference.
3
Connect PostgresAny Postgres URL — query, write, migrate.
4
Connect SlackChannels, DMs, threads, mentions.
5
Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
6
Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
7
Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

More Engineering workflows

Upgrade Impact Router to Module Code Owners

Maps a dependency-bump PR's affected modules to their CODEOWNERS, then DMs each owner on Slack with only the changelog slice that touches code they own.

Re-Voice IVR Prompts on Phone-Tree Config Merge

When a phone-tree config change merges in GitHub, regenerates the ElevenLabs audio for any prompt whose script changed in the diff and opens a follow-up PR adding the new audio…

Agent reviews model-license fit and suggests compliant swaps on the PR

When a PR adds a Hugging Face model, an agent reads the model card and license, judges fit against your commercial-use policy.

Scan for deprecated endpoints and email consumers a weekly sunset countdown

On a weekly schedule, scans the OpenAPI spec for endpoints marked deprecated with a sunset date, and emails each consuming team a countdown of how many days remain before removal.

Publish a versioned API changelog to Confluence on each release tag

On a new semver release tag, gathers the contract changes since the last release and writes a clean.

Gate breaking API PRs behind downstream consumer acknowledgement

When a PR introduces a breaking contract change, comments the impact summary back on the PR, applies a blocking label.

Browse all Engineering →

Run it inside a business

This workflow drops into a full company template. Import the org, and this is one of the playbooks its agents run.

Media

YouTube Studio

Scripts, edits, thumbnails, and scheduling — every week.

Marketing

Content Marketing Agency

SEO, blogs, social, and reporting on autopilot.

E-commerce

E-commerce Operator

Listings, support, inventory, and ads — running 24/7.

Browse all business templates →Solutions by industry →

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.

Join the Waitlist Browse all workflows →