DEVOPS

Page on-call when a Hugging Face Space build is stuck or errored

Polls Hugging Face Space runtime status on a schedule and opens a PagerDuty incident when a Space sits in a build or error state past a deadline, with a Slack heads-up.

CategoryDevOps
Enginesim
Difficultyintermediate
Triggerschedule
Steps5
Setup~15 min

How it runs

The automated pipeline, trigger to output.

  • TriggerSchedule polls Space status
  • ActionFetch runtime stage for production SpacesHugging FaceHugging Face
  • LogicKeep stuck-build or errored Spaces past deadline
  • ActionOpen PagerDuty incident per unhealthy SpacePagerDutyPagerDuty
  • OutputPost on-call Slack alert with incident linkSlack

What it does

Watches the runtime status of your production-tagged Hugging Face Spaces. If a Space is stuck building, crash-looping, or in a runtime error for longer than the allowed window, it opens a PagerDuty incident and drops a Slack note so on-call sees it immediately.

When to use it

When a Space backs a real user-facing feature and a silent build failure would otherwise go unnoticed until someone complains. This turns Space health into a paged signal.

How it works

  1. 1A schedule polls Space status every few minutes.
  2. 2Fetch runtime stage for each production-tagged Space via the Hugging Face API.
  3. 3A filter keeps Spaces in BUILD_ERROR, RUNTIME_ERROR, or a build that has exceeded the max duration.
  4. 4For each unhealthy Space, open a PagerDuty incident with the Space name and stage.
  5. 5Post a Slack message to the on-call channel linking the incident and the Space logs.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect Hugging FaceModels, datasets, spaces — the open-source hub.
  2. 2
    Connect PagerDutyIncidents, on-call, escalations.
  3. 3
    Connect SlackChannels, DMs, threads, mentions.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.