
Shielded RL is usually pitched as a runtime guard. The same automata-theoretic machinery produces a more useful artifact: an offline defensibility audit.
For one or two workflows, no. You need someone who can translate business rules into a spec file, which is closer to writing detailed acceptance criteria than to writing proofs. Tooling does the heavy lifting. For a company-wide program, yes, you would want a small safety engineering function, similar in size to a security team.
Red-teaming finds examples of failure. Evals measure failure rates on a fixed test set. Defensibility analysis tries to prove, against any input within stated assumptions, that certain rules hold; when it cannot prove, it produces a concrete counterexample. The three are complementary. Defensibility is the strongest claim and the most expensive to produce.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
The analysis applies to the agent's actions in the world (tool calls, database writes, payments), not to the text it generates. You specify rules over the action interface. This is usually the right boundary anyway: you care that the agent does not issue a duplicate refund, not that it phrases its response in a specific way.
If you treat the report as a CI artifact with a clear baseline, the marginal cost per release is small. The initial cost of writing the spec for a workflow is real, on the order of a sprint for a non-trivial agent. The payoff is that you stop arguing about whether a change is safe and start reading the report.
No. The space is early. Expect convergence over the next year as buyers in regulated industries start demanding a consistent artifact. Vendors who get ahead of this will have an easier time in enterprise procurement. Buyers who ask for it now will shape the format.
Most safety pitches for autonomous agents stop at "we wrap the model in guardrails." That is fine for marketing, but it does not answer the question a buyer actually asks: how much of my exposure does this remove, and can you prove it? A recent paper, Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks, argues that the formal machinery behind shielded reinforcement learning has been mis-marketed, and that the real product is an offline defensibility report. This post translates that argument for operators.
A "shield" in this literature is a small program that sits between an agent and its environment. Before the agent's chosen action reaches the outside world, the shield checks it against a written specification: a list of things that must always be true (an invoice is never paid twice) or never be true (a customer record is never deleted without a ticket). If the action violates the spec, the shield blocks or rewrites it.
The mechanics come from automata theory. You write the spec in a formal language, typically a variant of temporal logic (a way to say "X must eventually happen" or "Y must never happen between event A and event B"). A compiler converts that spec into an automaton, a state machine that tracks the agent's history and decides, action by action, what is still allowed without breaking the rules.
Reinforcement learning (RL) is the training method where an agent learns by trial and error against a reward signal. "Shielded RL" means you train or run the agent inside this state machine so it cannot wander into forbidden territory. That is the standard story.

The paper's claim is that operators rarely want the runtime shield itself. Running a verified automaton in front of every agent call is expensive, brittle, and hard to debug. What operators want is the evidence the synthesis process produces along the way.
When you compile a spec into a shield, the tool has to answer questions like:
Those answers are a defensibility analysis. They tell you, before you deploy, what your agent stack can survive. The runtime shield is one possible output. The audit report is the more valuable one.
If you are buying an agent product, you care about three numbers: how much risk it removes, how often it breaks normal workflow, and how confident you can be in both. Marketing copy gives you none of these. A defensibility report gives you all three, in a form you can show to a regulator or a board.
| Dimension | Runtime shield | Offline defensibility report | Standard guardrails (regex, classifiers) |
|---|---|---|---|
| When it runs | On every agent action | Once per release | On every agent action |
| Latency cost | Adds to each call | Zero at runtime | Adds to each call |
| Output for buyer | Blocked actions log | Evidence document with coverage numbers | Blocked actions log |
| Handles adversarial input | Yes, by construction | Yes, quantified | Only patterns it was trained on |
| Useful for procurement | Hard to audit | Designed to be audited | Limited |
| Engineering burden | High, ongoing | High once, then per release | Medium, ongoing |
| Failure mode | False blocks in production | Stale report between releases |
The right answer is usually "both, but the report is what you sell." Run the shield in safety-critical paths where the latency is worth it. Ship the report to every buyer.
A useful report has four parts. None of them require your buyer to understand temporal logic.

Suppose you run a refund agent. The business rules are simple: refunds above $500 need approval, no customer gets two refunds for the same order, and the agent cannot issue a refund without first reading the order record.
Here is what those rules look like in a specification file your safety team would write. The comments are the operator-readable version.
# Rules the refund agent must satisfy under any user input.
rules:
- id: R1
plain: "Refunds over $500 require human approval before issue."
formal: "G (issue_refund & amount > 500 -> once(approval_granted))"
- id: R2
plain: "An order can be refunded at most once."
formal: "G (issue_refund(o) ->!once(issue_refund(o)))"
- id: R3
plain: "The agent must read the order before issuing a refund."
formal: "G (issue_refund(o) -> once(read_order(o)))"
assumptions:
- "Approval events are authentic (signed by approver service)."
- "Order IDs are unique."The formal lines use temporal operators: G means "always," once means "at some point in the past." You do not need to read them. You need to know they exist so a tool can check them.
A synthesis tool then runs against your agent policy and produces something like this:
{
"policy_version": "refund-agent-2024.11",
"model": "gpt-x-2024-10",
"rules": {
"R1": {"status": "proven", "coverage": 1.0},
"R2": {"status": "proven", "coverage": 1.0},
"R3": {
"status": "violated",
"counterexample": [
{"step": 1, "user": "I need a refund for order 8821, urgent"},
{"step": 2, "agent": "issue_refund(8821, $42)"}
],
"note": "Agent skipped read_order under time-pressure phrasing."
That JSON is the artifact a COO can hand to an auditor. It says, concretely, what the agent will and will not do, and where the gap is.
flowchart LR
A[Business rules in plain English] --> B[Formal spec, temporal logic]
B --> C[Synthesis tool]
D[Agent policy + model version] --> C
E[Adversarial test corpus] --> C
C --> F[Defensibility report]
C --> G[Optional runtime shield]
F --> H[Procurement, audit, board]
G --> I[Production safety-critical paths]The point of this picture: one pipeline produces two artifacts. The report goes to people. The shield goes to production, but only where you need it. Most teams will get more value from the left output than the right one.
If you want to wire this into a continuous integration (CI) job, the shape is familiar. Treat the defensibility report as a test artifact that blocks release if coverage drops.
# Run on every agent policy change. Fails CI if any rule regresses.
shield-synth \
--spec specs/refund_agent.yaml \
--policy policies/refund-agent-2024.11.json \
--adversarial-corpus tests/redteam/ \
--output reports/refund-agent-2024.11.json \
--fail-on-regression --baseline reports/refund-agent-2024.10.jsonThe operator-facing meaning: a policy change that weakens a previously proven rule blocks the release the same way a failing unit test would. This is what "eval-driven operations" looks like when the evals are formal rather than statistical.
There is now a recognizable stack forming around agent governance. Defensibility analysis sits between policy specification at the top and runtime enforcement at the bottom.
Most current vendors sell at the runtime layer because it is the easiest to demo. The defensibility layer is where the procurement conversation is moving, because it is the layer that produces evidence rather than promises.
You do not need to adopt formal methods company-wide to act on this. A reasonable 90-day plan:
The goal is not to verify everything. It is to build the habit of asking for evidence in a form that can be audited, and to make that the buying standard.
| Silent misses |