DEVOPS
Page on-call when a Hugging Face Space build is stuck or errored
Polls Hugging Face Space runtime status on a schedule and opens a PagerDuty incident when a Space sits in a build or error state past a deadline, with a Slack heads-up.
How it runs
The automated pipeline, trigger to output.
- TriggerSchedule polls Space status
- ActionFetch runtime stage for production SpacesHugging Face
- LogicKeep stuck-build or errored Spaces past deadline
- ActionOpen PagerDuty incident per unhealthy SpacePagerDuty
- OutputPost on-call Slack alert with incident linkSlack
What it does
Watches the runtime status of your production-tagged Hugging Face Spaces. If a Space is stuck building, crash-looping, or in a runtime error for longer than the allowed window, it opens a PagerDuty incident and drops a Slack note so on-call sees it immediately.
When to use it
When a Space backs a real user-facing feature and a silent build failure would otherwise go unnoticed until someone complains. This turns Space health into a paged signal.
How it works
- 1A schedule polls Space status every few minutes.
- 2Fetch runtime stage for each production-tagged Space via the Hugging Face API.
- 3A filter keeps Spaces in BUILD_ERROR, RUNTIME_ERROR, or a build that has exceeded the max duration.
- 4For each unhealthy Space, open a PagerDuty incident with the Space name and stage.
- 5Post a Slack message to the on-call channel linking the incident and the Space logs.
Set it up
What you configure once, before turning it on.
- 1Connect Hugging FaceModels, datasets, spaces — the open-source hub.
- 2Connect PagerDutyIncidents, on-call, escalations.
- 3Connect SlackChannels, DMs, threads, mentions.
- 4Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
- 5Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
- 6Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.
More DevOps workflows
Hugging Face Spaces idle-runtime sweep with auto-pause
On a schedule, scans all Hugging Face Spaces for ones running idle past a threshold, pauses them to stop billing, and posts a Slack summary with the estimated monthly savings.
Slack-approved pause for idle Hugging Face Spaces
On a daily scan it finds idle paid Spaces and posts an interactive Slack approval; on approve it pauses the Space and logs the decision to a GitHub issue audit trail.
Generate a weekly de-flake report and assign Linear cleanup tickets
On a weekly schedule, aggregates the current quarantine manifest and recent flake history, builds a prioritized report.
Block costly Hugging Face Space hardware upgrades in PR review
When a pull request changes a Space's hardware config, it estimates the new monthly cost and posts a GitHub PR comment that flags upgrades crossing a budget ceiling.
Auto-release tests from quarantine once they prove stable
Triggered by a webhook from a nightly stability runner, checks whether quarantined tests have passed enough consecutive runs, removes the stable ones from quarantine in GitHub.
Quarantine a test on demand from a PR comment command
Triggered when an engineer comments a quarantine command on a pull request, validates the test name, commits the quarantine change to that PR branch, opens a tracking issue.
Run it inside a business
This workflow drops into a full company template. Import the org, and this is one of the playbooks its agents run.

Run this workflow in your colony.
14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
