DATA OPS

Pre-ingest PII screen on S3 CSV uploads with quarantine

Fires when a CSV lands in an S3 ingest bucket, classifies its columns with an LLM.

CategoryData Ops
Enginesim
Difficultyintermediate
Triggerevent
Steps6
Setup~15 min

How it runs

The automated pipeline, trigger to output.

  • TriggerS3 object-created event on a new CSV uploadAWS S3
  • ActionRead file header and sampled rows from S3AWS S3
  • ActionClassify columns and detect unexpected PII with an LLMOpenAI
  • LogicDecide clean versus quarantine by schema mismatch
  • ActionCopy offending file to S3 quarantine prefixAWS S3
  • OutputAlert data team in Slack with quarantine detailsSlack

What it does

This workflow stops sensitive data at the door. When a new CSV arrives in your S3 ingest bucket, it reads the header and a sample of rows, asks an LLM to classify each column, and compares against the dataset's declared schema. If columns carry PII that the schema did not expect, the file is moved to a quarantine prefix instead of flowing into the warehouse, and the data team is notified.

When to use it

Use it when external partners or internal teams drop files into a shared bucket and you cannot trust that every upload matches its agreed schema, so unexpected PII must be caught before ingestion.

How it works

  1. 1An S3 object-created event triggers on a new CSV in the ingest prefix.
  2. 2The workflow reads the file header and a sampled set of rows from S3.
  3. 3An OpenAI call classifies each column and identifies unexpected sensitive fields.
  4. 4A logic step decides clean versus quarantine based on the schema mismatch.
  5. 5If unexpected PII is found, the file is copied to the quarantine prefix in S3.
  6. 6A Slack alert reports the file, offending columns, and quarantine location.

Set it up

What you configure once, before turning it on.

  1. 1
    Connect AWS S3Buckets, objects, signed URLs.
  2. 2
    Connect OpenAIModels, embeddings, files.
  3. 3
    Connect SlackChannels, DMs, threads, mentions.
  4. 4
    Set each agent's modelWe leave models unset so you pick the tier — fast + cheap, or top-quality.
  5. 5
    Tune it to your dataEdit the prompts, filters, and field mappings so it matches how your team works.
  6. 6
    Test, then turn it onRun once against a sample, confirm the output, then enable the trigger.

Run this workflow in your colony.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.