OTHER

Datadog runaway log volume spike alert

Watches Datadog log ingestion per service on a short interval and pages the owning team in Slack when a service's volume jumps far above its recent baseline.

CategoryOther
Enginesim
Difficultyintermediate
Triggerschedule
Steps5
Setup~15 min

How it runs

The automated pipeline, trigger to output.

  • TriggerShort-interval schedule (every 15 min)
  • ActionQuery log volume per serviceDatadogDatadog
  • LogicFlag services exceeding baseline multiplier
  • LogicResolve owning team for flagged services
  • OutputAlert owning team in SlackSlack

What it does

This workflow catches log floods before they blow the budget. It samples Datadog log-ingestion volume per service, compares each service to its own trailing baseline, and fires a targeted alert when a service starts emitting dramatically more logs than usual — the classic symptom of a debug flag left on or a retry storm.

When to use it

Use it when a single noisy deploy can quietly 10x your log bill overnight and you want to catch the spike within minutes, not on next month's invoice. Best for teams with many services and tag-based ownership.

How it works

  1. 1A short-interval schedule (e.g. every 15 minutes) triggers the check.
  2. 2The Datadog action queries indexed log volume grouped by `service` over the recent window.
  3. 3A logic step computes each service's trailing baseline and flags any service whose current rate exceeds its threshold multiplier.
  4. 4For flagged services it resolves the owning team from tags.
  5. 5The output step posts a per-service spike alert to the owning team's Slack channel with the volume, multiplier, and a Datadog query link.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect DatadogMetrics, traces, log search.
  2. 2
    Connect SlackChannels, DMs, threads, mentions.
  3. 3
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  4. 4
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  5. 5
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.