
A fully compliant agent is a liability. Here is how to engineer structured refusal into autonomous agents, with a working taxonomy and operator guidance.
Only if you skip the counter-proposal mode. Hard refusals do feel abrupt. Counter-proposals ("I can refund $500 now, or escalate the full $820 to a manager") usually score higher in user satisfaction than the original compliant action, because the user feels heard and the outcome is bounded.
Yes. Model-level safety covers a narrow band of universal harms. It does not know your refund policy, your budget caps, your vendor allow-list, or your authority tiers. Those are your rules, and they belong in a layer you control.
Keep the policy decision outside the model. The agent proposes an action; a non-LLM module decides allow or refuse; the tool layer obeys the module, not the agent. Prompt injection cannot rewrite Python that is not in the prompt.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
Refusal events are evidence of control operation. For most internal control frameworks, the question is not whether a policy exists, but whether it ran. A timestamped log of "agent declined this action, here is the reason code" is exactly that evidence, generated as a side effect of normal operation.
When announcing the refusal would itself create harm or noise: known spam, repeated identical requests after a clear refusal, or adversarial probing. Silent non-action still logs the event internally; it just does not respond to the requester. Use it sparingly, and document the exact conditions that trigger it.
If you are putting autonomous agents in front of customers, suppliers, or internal staff, the question is no longer "can it do the task." The question is what it does when the task is wrong, unsafe, off-policy, or simply not in your interest to perform. This post is a state-of-the-field walk-through for operators who have to make that call.
A fully compliant agent is one that executes any well-formed request from an authorized user. That sounds reasonable until you write out the business consequences. A compliant procurement agent will approve a duplicate invoice if asked nicely. A compliant support agent will issue a refund outside policy if the customer is persistent. A compliant ops agent will spin up infrastructure that blows the monthly budget because a developer typed the command.
The recent paper Towards Responsibly Non-Compliant Machines makes the argument plainly: autonomous agents need the engineered capacity to refuse, and refusal itself comes in many forms with different costs and different downstream effects. The operator stake is direct. Every refusal you do not design is a refusal your agent will improvise, or worse, skip.

Not all "no" answers are the same. Before you can engineer refusal, you need a vocabulary for it. Below is a working taxonomy that maps the modes discussed in the literature onto operator-visible behavior.
| Refusal mode | What the agent does | When you want it | Operator cost if missing |
|---|---|---|---|
| Hard refusal | Declines the action, states the rule, takes no further step | Illegal requests, hard policy breaches | Regulatory and legal exposure |
| Soft refusal | Declines, offers an alternative path | Out-of-policy refunds, scope creep | Lost customers, agent does the wrong thing instead |
| Deferred refusal | Pauses, escalates to a human, holds state | Ambiguous high-value actions | Either premature action or stalled work |
| Counter-proposal | Suggests a modified action and asks for confirmation | Cost-overrun risk, partial authorization | Budget overruns, rework |
| Silent non-action | Does not perform, does not announce, logs internally | Spam, repeated identical requests, known abuse | Wasted compute, prompt-injection success |
| Conditional compliance | Performs, but with constraints (lower limit, dry-run) | Trust-but-verify cases, new users | All-or-nothing behavior in graded-trust scenarios |
You do not need every row from day one. You do need to pick which rows your agent supports, label them, and route to them deterministically.
The core claim of the paper is that compliance is not a default to deviate from; it is one option among several, and "responsibly non-compliant" behavior has to be engineered with the same care as task performance. Three points are worth pulling out for operators.
First, refusal has a justification structure. An agent that says no without being able to explain why fails the same way a junior employee fails: nobody can tell if the refusal was correct, and nobody learns. Second, refusal interacts with authority. The same request from a finance director and a contractor should not produce the same outcome, and your agent has to know which is which. Third, refusal is a social act. How the agent declines shapes the user's next move. A flat refusal generates a workaround. A counter-proposal generates a corrected request.
For more recent context on how agentic systems handle conflicting instructions and policy, see the survey-style discussions in arXiv listings on agent governance. The pattern is consistent: refusal is moving from a safety bolt-on to a core part of the agent loop.
Here is the practical part. You decide what your agent will not do, and you make that decision a first-class part of the execution flow, not a string in a system prompt.
flowchart TD
A[User or upstream agent request] --> B[Intent + authority check]
B --> C{Policy lookup}
C -->|Allowed| D[Execute tool]
C -->|Disallowed| E[Select refusal mode]
C -->|Ambiguous| F[Escalate or counter-propose]
E --> G[Log refusal event]
F --> G
D --> H[Log action event]
G --> I[Weekly policy review]
H --> IThe diagram says something simple: every request hits a policy lookup before it hits a tool, and every refusal is logged in the same place as every action. That single property, refusals as events, is what makes the system reviewable.
Below is a small policy module that an agent's tool layer can call before executing any action. It returns a structured decision instead of a free-text "sorry I cannot." The structure is what lets you measure and audit it.
# Policy gate: returns a decision the agent must obey before any tool call.
# Used by ops, support, and procurement agents alike.
from dataclasses import dataclass
from typing import Literal, Optional
RefusalMode = Literal[
"allow", "hard_refuse", "soft_refuse",
"defer", "counter_propose", "conditional"
]
@dataclass
class Decision:
mode: RefusalMode
reason_code: str
explanation: str
alternative: Optional[dict] = None
def evaluate(action: dict, actor: dict, context: dict) -> Decision:
amount =
What this does for your business: every refusal carries a reason code, every reason code can be counted, and every counter-proposal carries an alternative you can A/B test. You are no longer guessing whether the agent is too strict or too lenient. You can see it.
If the policy module is the brain, the event log is the memory. Treat refusal events with the same seriousness as transactions.
{
"event_type": "agent_refusal",
"timestamp": "2025-03-14T10:22:11Z",
"agent": "support-tier1",
"actor": {"id": "cust_8821", "role": "customer"},
"requested_action": {"kind": "refund", "amount_usd": 820},
"decision": {
"mode": "defer",
"reason_code": "REFUND_OVER_LIMIT",
"alternative": {"kind": "refund", "amount_usd": 500}
},
"downstream": {"human_handoff_id": "tkt_44102"}
A weekly review of reason_code counts is the cheapest policy feedback loop you will ever build. If REFUND_OVER_LIMIT is firing 400 times a week, your refund limit is wrong or your customers have a real grievance. Either way, you learn.

Refusal rate is not a quality metric on its own. A 0% refusal rate means the agent is reckless. A 90% refusal rate means it is useless. The operator question is whether the refusals are the right ones.
Here are the metrics worth tracking from week one:
# Quick weekly report: refusal mix by reason code over the last 7 days.
# Run this as part of the operations review.
duckdb -c "
SELECT reason_code,
COUNT(*) AS events,
AVG(downstream.handoff) AS handoff_rate,
AVG(downstream.retry) AS retry_rate
FROM read_json_auto('agent_events/*.json')
WHERE event_type = 'agent_refusal'
AND timestamp > now() - INTERVAL 7 DAY
GROUP BY reason_code
ORDER BY events DESC;
"What this gives you: a one-page Monday-morning view of where your agent said no last week, and whether those refusals were sticky or whether users worked around them.
A refusal layer collapses without a model of who is asking. The same request to wire funds means one thing from the CFO and another from a vendor portal. Three operator decisions follow:
This is also where AI governance stops being a slide and starts being a control. An auditor does not want to read your prompt; they want to see the policy module, the event log, and the review cadence. Those three artifacts cover most internal control frameworks for automated decisioning.
You do not need to implement every refusal mode on day one. A realistic sequence for an operator team:
allow, hard_refuse, and defer only. Ship event logging.counter_propose for the two highest-volume soft refusals. Measure acceptance.The investment is small. The thing you are buying is the ability to honestly say, in a review or an incident, what your agent does when it should not do what it was asked.