Agent Hive mark

Frequently asked questions

Will a refusal layer make my agent feel slow or unhelpful?

Only if you skip the counter-proposal mode. Hard refusals do feel abrupt. Counter-proposals ("I can refund $500 now, or escalate the full $820 to a manager") usually score higher in user satisfaction than the original compliant action, because the user feels heard and the outcome is bounded.

Do I need this if I use a model with built-in safety training?

Yes. Model-level safety covers a narrow band of universal harms. It does not know your refund policy, your budget caps, your vendor allow-list, or your authority tiers. Those are your rules, and they belong in a layer you control.

How do I avoid the agent being jailbroken into bypassing the policy gate?

Keep the policy decision outside the model. The agent proposes an action; a non-LLM module decides allow or refuse; the tool layer obeys the module, not the agent. Prompt injection cannot rewrite Python that is not in the prompt.

Why a compliant agent is a risky agent

A fully compliant agent is one that executes any well-formed request from an authorized user. That sounds reasonable until you write out the business consequences. A compliant procurement agent will approve a duplicate invoice if asked nicely. A compliant support agent will issue a refund outside policy if the customer is persistent. A compliant ops agent will spin up infrastructure that blows the monthly budget because a developer typed the command.

The recent paper Towards Responsibly Non-Compliant Machines makes the argument plainly: autonomous agents need the engineered capacity to refuse, and refusal itself comes in many forms with different costs and different downstream effects. The operator stake is direct. Every refusal you do not design is a refusal your agent will improvise, or worse, skip.

The four business losses a missing refusal layer creates

Policy drift. Staff and customers learn what the agent will let them get away with, and that becomes the new floor.
Audit gaps. You cannot show a regulator the moments your system declined to act, because you never recorded them.
Escalation overload. Without graceful refusals, every edge case turns into a human ticket.
Reputational exposure. A bluntly-worded "I cannot help with that" at the wrong moment costs accounts.

A taxonomy of refusal

Not all "no" answers are the same. Before you can engineer refusal, you need a vocabulary for it. Below is a working taxonomy that maps the modes discussed in the literature onto operator-visible behavior.

Refusal mode	What the agent does	When you want it	Operator cost if missing
Hard refusal	Declines the action, states the rule, takes no further step	Illegal requests, hard policy breaches	Regulatory and legal exposure
Soft refusal	Declines, offers an alternative path	Out-of-policy refunds, scope creep	Lost customers, agent does the wrong thing instead
Deferred refusal	Pauses, escalates to a human, holds state	Ambiguous high-value actions	Either premature action or stalled work
Counter-proposal	Suggests a modified action and asks for confirmation	Cost-overrun risk, partial authorization	Budget overruns, rework
Silent non-action	Does not perform, does not announce, logs internally	Spam, repeated identical requests, known abuse	Wasted compute, prompt-injection success
Conditional compliance	Performs, but with constraints (lower limit, dry-run)	Trust-but-verify cases, new users	All-or-nothing behavior in graded-trust scenarios

You do not need every row from day one. You do need to pick which rows your agent supports, label them, and route to them deterministically.

What the research actually argues

The core claim of the paper is that compliance is not a default to deviate from; it is one option among several, and "responsibly non-compliant" behavior has to be engineered with the same care as task performance. Three points are worth pulling out for operators.

First, refusal has a justification structure. An agent that says no without being able to explain why fails the same way a junior employee fails: nobody can tell if the refusal was correct, and nobody learns. Second, refusal interacts with authority. The same request from a finance director and a contractor should not produce the same outcome, and your agent has to know which is which. Third, refusal is a social act. How the agent declines shapes the user's next move. A flat refusal generates a workaround. A counter-proposal generates a corrected request.

For more recent context on how agentic systems handle conflicting instructions and policy, see the survey-style discussions in arXiv listings on agent governance. The pattern is consistent: refusal is moving from a safety bolt-on to a core part of the agent loop.

Wiring refusal into the agent loop

Here is the practical part. You decide what your agent will not do, and you make that decision a first-class part of the execution flow, not a string in a system prompt.

flowchart TD
 A[User or upstream agent request] --> B[Intent + authority check]
 B --> C{Policy lookup}
 C -->|Allowed| D[Execute tool]
 C -->|Disallowed| E[Select refusal mode]
 C -->|Ambiguous| F[Escalate or counter-propose]
 E --> G[Log refusal event]
 F --> G
 D --> H[Log action event]
 G --> I[Weekly policy review]
 H --> I

The diagram says something simple: every request hits a policy lookup before it hits a tool, and every refusal is logged in the same place as every action. That single property, refusals as events, is what makes the system reviewable.

A minimal refusal policy in code

Below is a small policy module that an agent's tool layer can call before executing any action. It returns a structured decision instead of a free-text "sorry I cannot." The structure is what lets you measure and audit it.

# Policy gate: returns a decision the agent must obey before any tool call.
# Used by ops, support, and procurement agents alike.
 
from dataclasses import dataclass
from typing import Literal, Optional
 
RefusalMode = Literal[
 "allow", "hard_refuse", "soft_refuse",
 "defer", "counter_propose", "conditional"
]
 
@dataclass
class Decision:
 mode: RefusalMode
 reason_code: str
 explanation: str
 alternative: Optional[dict] = None
 
def evaluate(action: dict, actor: dict, context: dict) -> Decision:
 amount =

What this does for your business: every refusal carries a reason code, every reason code can be counted, and every counter-proposal carries an alternative you can A/B test. You are no longer guessing whether the agent is too strict or too lenient. You can see it.

Logging refusals as business events

If the policy module is the brain, the event log is the memory. Treat refusal events with the same seriousness as transactions.

{
 "event_type": "agent_refusal",
 "timestamp": "2025-03-14T10:22:11Z",
 "agent": "support-tier1",
 "actor": {"id": "cust_8821", "role": "customer"},
 "requested_action": {"kind": "refund", "amount_usd": 820},
 "decision": {
 "mode": "defer",
 "reason_code": "REFUND_OVER_LIMIT",
 "alternative": {"kind": "refund", "amount_usd": 500}
 },
 "downstream": {"human_handoff_id": "tkt_44102"}

A weekly review of reason_code counts is the cheapest policy feedback loop you will ever build. If REFUND_OVER_LIMIT is firing 400 times a week, your refund limit is wrong or your customers have a real grievance. Either way, you learn.

Refusal events feeding the weekly policy review

Measuring refusal quality

Refusal rate is not a quality metric on its own. A 0% refusal rate means the agent is reckless. A 90% refusal rate means it is useless. The operator question is whether the refusals are the right ones.

Here are the metrics worth tracking from week one:

Refusal rate by reason code. Tells you which policies are loadbearing.
Refusal-to-escalation conversion. How often a refusal turns into a human-resolved ticket. Low numbers mean the agent is refusing things it should just do.
Refusal-to-retry conversion. How often a user re-asks after a refusal. High numbers mean your refusal explanations are not clear.
Counter-proposal acceptance rate. The single best signal that your agent is being helpful while declining.
Policy override rate. How often a human approver reverses a refusal. High rates mean the policy is too tight.

# Quick weekly report: refusal mix by reason code over the last 7 days.
# Run this as part of the operations review.
 
duckdb -c "
SELECT reason_code,
 COUNT(*) AS events,
 AVG(downstream.handoff) AS handoff_rate,
 AVG(downstream.retry) AS retry_rate
FROM read_json_auto('agent_events/*.json')
WHERE event_type = 'agent_refusal'
 AND timestamp > now() - INTERVAL 7 DAY
GROUP BY reason_code
ORDER BY events DESC;
"

What this gives you: a one-page Monday-morning view of where your agent said no last week, and whether those refusals were sticky or whether users worked around them.

Governance, authority, and graded trust

A refusal layer collapses without a model of who is asking. The same request to wire funds means one thing from the CFO and another from a vendor portal. Three operator decisions follow:

Decide your authority tiers. Three is usually enough: low-trust (external), mid-trust (employees), high-trust (named approvers). Resist the urge to over-engineer.
Pin tools to tiers. Each tool, each amount band, and each data class gets a minimum tier. Below that, the policy gate refuses.
Make tier elevation explicit and time-boxed. Temporary elevations should expire, log, and require justification.

This is also where AI governance stops being a slide and starts being a control. An auditor does not want to read your prompt; they want to see the policy module, the event log, and the review cadence. Those three artifacts cover most internal control frameworks for automated decisioning.

A pragmatic rollout

You do not need to implement every refusal mode on day one. A realistic sequence for an operator team:

Week 1-2: list the top ten actions your agent will perform and the top five that it must refuse. Write them down.
Week 3-4: implement the policy gate with allow, hard_refuse, and defer only. Ship event logging.
Month 2: add counter_propose for the two highest-volume soft refusals. Measure acceptance.
Month 3: introduce authority tiers and conditional compliance. Begin weekly policy review using the reason-code report.
Month 4 onward: tune. Most policy changes will be moving thresholds, not adding rules.

The investment is small. The thing you are buying is the ability to honestly say, in a review or an incident, what your agent does when it should not do what it was asked.

Responsibly Non-Compliant Agents: A Refusal Design Guide