Agent Hive mark

Frequently asked questions

Do I need formal methods experts on staff to do this?

For one or two workflows, no. You need someone who can translate business rules into a spec file, which is closer to writing detailed acceptance criteria than to writing proofs. Tooling does the heavy lifting. For a company-wide program, yes, you would want a small safety engineering function, similar in size to a security team.

How is this different from red-teaming or evals?

Red-teaming finds examples of failure. Evals measure failure rates on a fixed test set. Defensibility analysis tries to prove, against any input within stated assumptions, that certain rules hold; when it cannot prove, it produces a concrete counterexample. The three are complementary. Defensibility is the strongest claim and the most expensive to produce.

What about agents that use large language models with open-ended outputs?

What a shield actually is, in plain English

A "shield" in this literature is a small program that sits between an agent and its environment. Before the agent's chosen action reaches the outside world, the shield checks it against a written specification: a list of things that must always be true (an invoice is never paid twice) or never be true (a customer record is never deleted without a ticket). If the action violates the spec, the shield blocks or rewrites it.

The mechanics come from automata theory. You write the spec in a formal language, typically a variant of temporal logic (a way to say "X must eventually happen" or "Y must never happen between event A and event B"). A compiler converts that spec into an automaton, a state machine that tracks the agent's history and decides, action by action, what is still allowed without breaking the rules.

Reinforcement learning (RL) is the training method where an agent learns by trial and error against a reward signal. "Shielded RL" means you train or run the agent inside this state machine so it cannot wander into forbidden territory. That is the standard story.

Shield sitting between agent and environment

The reframing: shields are reports, not runtime

The paper's claim is that operators rarely want the runtime shield itself. Running a verified automaton in front of every agent call is expensive, brittle, and hard to debug. What operators want is the evidence the synthesis process produces along the way.

When you compile a spec into a shield, the tool has to answer questions like:

Is there any sequence of adversarial inputs that forces the agent into a forbidden state?
Which states are "winning" (the agent can always recover) versus "losing" (the agent is doomed regardless of choice)?
How many of the forbidden behaviors does the current policy actually avoid on its own, without the shield intervening?

Those answers are a defensibility analysis. They tell you, before you deploy, what your agent stack can survive. The runtime shield is one possible output. The audit report is the more valuable one.

Why this matters for buyers

If you are buying an agent product, you care about three numbers: how much risk it removes, how often it breaks normal workflow, and how confident you can be in both. Marketing copy gives you none of these. A defensibility report gives you all three, in a form you can show to a regulator or a board.

A comparison: runtime shield vs. offline defensibility report

Dimension	Runtime shield	Offline defensibility report	Standard guardrails (regex, classifiers)
When it runs	On every agent action	Once per release	On every agent action
Latency cost	Adds to each call	Zero at runtime	Adds to each call
Output for buyer	Blocked actions log	Evidence document with coverage numbers	Blocked actions log
Handles adversarial input	Yes, by construction	Yes, quantified	Only patterns it was trained on
Useful for procurement	Hard to audit	Designed to be audited	Limited
Engineering burden	High, ongoing	High once, then per release	Medium, ongoing
Failure mode	False blocks in production	Stale report between releases

The right answer is usually "both, but the report is what you sell." Run the shield in safety-critical paths where the latency is worth it. Ship the report to every buyer.

What goes in a defensibility report

A useful report has four parts. None of them require your buyer to understand temporal logic.

The specification. Plain-English rules ("a refund above $500 requires a human approver") with their formal translation attached as an appendix. The plain-English version is what the business signs off on.
Coverage. Of the rules you wrote, how many were proven to hold against any adversarial input? How many hold only under stated assumptions? How many could not be proven?
Counterexamples. For rules that did not hold, a concrete trace: "here is the sequence of three user messages that drives the agent into a forbidden state." This is the most useful artifact for engineering and the most damning for vendors who skip it.
Residual risk. A short list of behaviors the analysis cannot rule out, with proposed mitigations (a human-in-the-loop step, a rate limit, a runtime shield on that path only).

A worked example: refund agent

Suppose you run a refund agent. The business rules are simple: refunds above $500 need approval, no customer gets two refunds for the same order, and the agent cannot issue a refund without first reading the order record.

Here is what those rules look like in a specification file your safety team would write. The comments are the operator-readable version.

# Rules the refund agent must satisfy under any user input.
rules:
 - id: R1
 plain: "Refunds over $500 require human approval before issue."
 formal: "G (issue_refund & amount > 500 -> once(approval_granted))"
 - id: R2
 plain: "An order can be refunded at most once."
 formal: "G (issue_refund(o) ->!once(issue_refund(o)))"
 - id: R3
 plain: "The agent must read the order before issuing a refund."
 formal: "G (issue_refund(o) -> once(read_order(o)))"
 
assumptions:
 - "Approval events are authentic (signed by approver service)."
 - "Order IDs are unique."

The formal lines use temporal operators: G means "always," once means "at some point in the past." You do not need to read them. You need to know they exist so a tool can check them.

A synthesis tool then runs against your agent policy and produces something like this:

{
 "policy_version": "refund-agent-2024.11",
 "model": "gpt-x-2024-10",
 "rules": {
 "R1": {"status": "proven", "coverage": 1.0},
 "R2": {"status": "proven", "coverage": 1.0},
 "R3": {
 "status": "violated",
 "counterexample": [
 {"step": 1, "user": "I need a refund for order 8821, urgent"},
 {"step": 2, "agent": "issue_refund(8821, $42)"}
 ],
 "note": "Agent skipped read_order under time-pressure phrasing."

That JSON is the artifact a COO can hand to an auditor. It says, concretely, what the agent will and will not do, and where the gap is.

The pipeline, end to end

flowchart LR
 A[Business rules in plain English] --> B[Formal spec, temporal logic]
 B --> C[Synthesis tool]
 D[Agent policy + model version] --> C
 E[Adversarial test corpus] --> C
 C --> F[Defensibility report]
 C --> G[Optional runtime shield]
 F --> H[Procurement, audit, board]
 G --> I[Production safety-critical paths]

The point of this picture: one pipeline produces two artifacts. The report goes to people. The shield goes to production, but only where you need it. Most teams will get more value from the left output than the right one.

Running it in your build

If you want to wire this into a continuous integration (CI) job, the shape is familiar. Treat the defensibility report as a test artifact that blocks release if coverage drops.

# Run on every agent policy change. Fails CI if any rule regresses.
shield-synth \
 --spec specs/refund_agent.yaml \
 --policy policies/refund-agent-2024.11.json \
 --adversarial-corpus tests/redteam/ \
 --output reports/refund-agent-2024.11.json \
 --fail-on-regression --baseline reports/refund-agent-2024.10.json

The operator-facing meaning: a policy change that weakens a previously proven rule blocks the release the same way a failing unit test would. This is what "eval-driven operations" looks like when the evals are formal rather than statistical.

Where this fits in the agent governance stack

There is now a recognizable stack forming around agent governance. Defensibility analysis sits between policy specification at the top and runtime enforcement at the bottom.

Policy layer. Business rules, written by operators and legal. Source of truth.
Specification layer. Formal translation of those rules. Owned by a safety engineer.
Defensibility layer. Synthesis tools that prove, refute, or quantify rule satisfaction against a given policy. Output: reports.
Runtime layer. Shields, monitors, classifiers. Catches what the layers above missed, or enforces rules where the cost is justified.
Incident layer. Logging, replay, post-mortem. Feeds back into specs.

Most current vendors sell at the runtime layer because it is the easiest to demo. The defensibility layer is where the procurement conversation is moving, because it is the layer that produces evidence rather than promises.

What operators should do this quarter

You do not need to adopt formal methods company-wide to act on this. A reasonable 90-day plan:

Pick one agent workflow with real money or real customer data attached. Refunds, account changes, ticket routing.
Write the rules in plain English. Get sign-off from the business owner and legal. This step alone usually surfaces disagreements worth knowing about.
Ask your agent vendor for a defensibility report against those rules. If they cannot produce one, that is a data point.
Run an internal adversarial test, even a small one. A hundred prompts written by your own team to try to break the rules.
Decide where, if anywhere, you need a runtime shield. Most workflows will not. A few will.

The goal is not to verify everything. It is to build the habit of asking for evidence in a form that can be audited, and to make that the buying standard.

Shield Synthesis as Defensibility Analysis for Agent…