ENGINEERING
Bench a new Replicate version against a Hugging Face eval dataset
On manual launch, pulls a versioned evaluation dataset from Hugging Face, scores a candidate Replicate model version case-by-case, writes the full results table to Postgres.
How it runs
The automated pipeline, trigger to output.
- TriggerManual launch with version + dataset revision
- ActionDownload pinned eval datasetHugging Face
- ActionScore candidate version case-by-caseReplicate
- ActionPersist results to bench-history tablePostgres
- LogicEvaluate aggregate metrics vs acceptance bar
- OutputPost pass/fail scorecard to SlackSlack
What it does
Produces a reproducible scorecard for a candidate Replicate version using a versioned Hugging Face evaluation dataset as the source of truth. It scores every case, persists the full results so runs are comparable over time, and returns a clear pass/fail summary against your acceptance bar.
When to use it
Use it when you want an on-demand bench you can trigger before any promotion decision, with the eval set governed in Hugging Face so reviewers can audit exactly what the model was tested on. Good for pre-release sign-off and for comparing two candidate versions.
How it works
- 1An operator launches the run manually, passing the candidate Replicate version and the Hugging Face dataset revision.
- 2The flow downloads the pinned dataset revision from Hugging Face.
- 3It runs each case through the candidate Replicate version and collects scores and latency.
- 4It writes the per-case results and run metadata to a Postgres bench-history table.
- 5A logic step evaluates aggregate metrics against the acceptance threshold to set pass or fail.
- 6It posts the scorecard with the dataset revision and run link to Slack.
Set it up
What you configure once, before turning it on.
- 1Connect Hugging FaceModels, datasets, spaces — the open-source hub.
- 2Connect ReplicateImage, video, and model inference.
- 3Connect PostgresAny Postgres URL — query, write, migrate.
- 4Connect SlackChannels, DMs, threads, mentions.
- 5Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
- 6Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
- 7Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.
More Engineering workflows
Upgrade Impact Router to Module Code Owners
Maps a dependency-bump PR's affected modules to their CODEOWNERS, then DMs each owner on Slack with only the changelog slice that touches code they own.
Re-Voice IVR Prompts on Phone-Tree Config Merge
When a phone-tree config change merges in GitHub, regenerates the ElevenLabs audio for any prompt whose script changed in the diff and opens a follow-up PR adding the new audio…
Agent reviews model-license fit and suggests compliant swaps on the PR
When a PR adds a Hugging Face model, an agent reads the model card and license, judges fit against your commercial-use policy.
Scan for deprecated endpoints and email consumers a weekly sunset countdown
On a weekly schedule, scans the OpenAPI spec for endpoints marked deprecated with a sunset date, and emails each consuming team a countdown of how many days remain before removal.
Publish a versioned API changelog to Confluence on each release tag
On a new semver release tag, gathers the contract changes since the last release and writes a clean.
Gate breaking API PRs behind downstream consumer acknowledgement
When a PR introduces a breaking contract change, comments the impact summary back on the PR, applies a blocking label.
Run it inside a business
This workflow drops into a full company template. Import the org, and this is one of the playbooks its agents run.

Run this workflow in your colony.
14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
