DATA OPS

BigQuery Regression LLM Root-Cause Explainer

On a detected cost spike it sends the old and new query SQL plus job stats to an LLM, which explains in plain English why the query got more expensive and suggests a concrete fix.

CategoryData Ops
Enginesim
Difficultyadvanced
Triggerschedule
Steps5
Setup~25 min

How it runs

The automated pipeline, trigger to output.

  • TriggerDaily schedule
  • ActionFind largest regressor + job statsGoogle BigQueryBigQuery
  • ActionPull current and previous SQLGitHubGitHub
  • ActionLLM explains root cause + suggests fixOpenAI
  • OutputSend diagnosis to SlackSlack

What it does

Turns a raw slot-hour spike into a human-readable root-cause: it feeds the before/after SQL and BigQuery job statistics to an LLM that explains the regression (e.g. a dropped partition filter, a new cross join) and proposes a fix.

When to use it

When your team can detect cost spikes but loses time diagnosing *why* a query got slower. Use it to get a first-pass diagnosis attached to every regression alert.

How it works

  1. 1A scheduled trigger fires daily.
  2. 2A BigQuery query identifies the scheduled query with the largest slot-hour increase versus baseline, along with bytes scanned and stage timing.
  3. 3A GitHub action pulls the current and previous SQL for that query.
  4. 4An OpenAI step receives both SQL versions and the job stats and returns a root-cause explanation plus a suggested optimization.
  5. 5A Slack message delivers the spike metrics, the LLM diagnosis, and the proposed fix.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect BigQueryDatasets, queries, schemas.
  2. 2
    Connect GitHubRepos, issues, pull requests, actions.
  3. 3
    Connect OpenAIModels, embeddings, files.
  4. 4
    Connect SlackChannels, DMs, threads, mentions.
  5. 5
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  6. 6
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  7. 7
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.