AI AGENTS

Shell-Gated Bump with Benchmark Regression Guard

Beyond passing tests, the agent runs a benchmark in the sandboxed shell, compares it to the baseline.

CategoryAI Agents
Enginepaperclip
Difficultyadvanced
Triggerschedule
Steps5
Setup~25 min

How it runs

The automated pipeline, trigger to output.

  • TriggerScheduled upgrade scan
  • ActionRun tests and benchmark for the bump in sandboxed shellShell
  • LogicGate: tests pass and benchmark delta under threshold
  • ActionOpen GitLab MR with before/after benchmark tableGitLabGitLab
  • OutputReturn MR link with performance comparisonGitLabGitLab

What it does

This agent gates dependency upgrades on two signals at once: correctness and speed. It runs the test suite and a benchmark in a sandboxed shell, then opens a GitLab MR only when tests pass and performance stays within an acceptable delta of the recorded baseline.

When to use it

Use it for performance-sensitive services where a silently slower dependency is as dangerous as a broken one. The benchmark guard catches regressions that green tests miss.

How it works

  1. 1A schedule launches the run.
  2. 2The agent pins one package upgrade and, in a sandboxed shell, runs both the test suite and the benchmark script, capturing timing numbers.
  3. 3A logic gate compares results: tests must pass and the benchmark delta must stay under the configured threshold.
  4. 4If either check fails, the run aborts with a logged reason and no MR.
  5. 5On a clean pass, the agent opens a GitLab MR embedding the before/after benchmark table.
  6. 6The MR link is returned as the final output.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect ShellRun sandboxed commands inside the workspace.
  2. 2
    Connect GitLabRepos, MRs, pipelines, registry.
  3. 3
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  4. 4
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  5. 5
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.