AI & RAG

Datadog Spike Explainer with Runbook Citations in Slack

When a Datadog monitor fires, this looks up the matching runbook knowledge base and posts a plain-English 'why this spiked' answer to Slack with linked Confluence citations.

CategoryAI & RAG
Enginesim
Difficultyintermediate
Triggerevent
Steps6
Setup~15 min

How it runs

The automated pipeline, trigger to output.

  • TriggerDatadog monitor enters Alert stateDatadogDatadog
  • ActionFetch spiking metric, tags, and time-series windowDatadogDatadog
  • ActionEmbed alert context and search runbook vectorsPostgreSQLPostgres
  • ActionDraft grounded explanation from retrieved passagesOpenAI
  • LogicBranch on retrieval confidence (answer vs. escalate)
  • OutputPost explanation with Confluence citations to SlackSlack

What it does

Turns a raw Datadog alert into an answer. The moment a monitor crosses threshold, it pulls the spiking metric and recent context, searches your indexed runbooks for the most relevant fixes, and posts a concise explanation to the on-call Slack channel — every claim backed by a deep link to the source Confluence page.

When to use it

Use it when on-call engineers waste the first ten minutes of every incident asking 'what does this alert even mean and where's the runbook.' Best for teams that already keep runbooks in Confluence and want the answer pushed to them instead of hunting for it.

How it works

  1. 1A Datadog monitor transitions to Alert and fires the webhook.
  2. 2The flow fetches the triggering metric, tags, and the surrounding time-series window from Datadog.
  3. 3It embeds the alert context and runs a similarity search over the runbook chunks stored in Postgres (pgvector).
  4. 4An LLM drafts a grounded explanation, citing only the retrieved passages.
  5. 5A confidence check decides whether to post a full answer or a 'no strong match — escalate' note.
  6. 6The answer, with Confluence citation links, lands in the incident Slack channel.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect DatadogMetrics, traces, log search.
  2. 2
    Connect PostgresAny Postgres URL — query, write, migrate.
  3. 3
    Connect OpenAIModels, embeddings, files.
  4. 4
    Connect ConfluenceSpaces, pages, blueprints.
  5. 5
    Connect SlackChannels, DMs, threads, mentions.
  6. 6
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  7. 7
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  8. 8
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.