
A new label-free signal checks whether subtask answers compose into the whole answer, outperforming self-consistency and semantic entropy on multi-step…
For a task with n subtasks, you pay roughly one extra whole-task call plus n subtask calls, beyond what you would already pay for a decompositional agent. For small n (3 to 5), that is far cheaper than self-consistency at k=20, and it catches a different and more dangerous class of error.
No. The check is a wrapper around whatever model you already use. It works with closed-source models accessed by API. You only need to be able to call the model twice in two different ways: once for the whole, once for each part.
Chain-of-thought verification asks the model to check its own reasoning trace, often with a second prompt. That is closer to P(True): self-evaluation, which is known to be overconfident. Operadic consistency does not ask the model to evaluate itself. It runs the model in two structurally different ways and looks for a gap. The model is not the judge; the gap is.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
Yes, and it is arguably a better fit there. A tool-calling agent that decomposes a request into tool calls already has an explicit decomposition. The composition rule is whatever the agent does with the tool outputs. You can run the whole request as a single direct query in a sandbox, run the tool chain as planned, and check that the two answers agree before taking any irreversible action.
Open-ended generation (marketing copy, brainstorming, draft writing) has no composition rule, so the check does not apply. Low-stakes internal use (summaries, notes) does not justify the extra calls. Keep the check for structured, high-stakes, decompositional workflows: that is where the cost of a confident wrong answer is highest and the signal is sharpest.
Most teams running AI agents in production have the same problem: the model is confidently wrong, and you find out from a customer. The usual confidence checks (asking the model how sure it is, sampling the same question many times, measuring how varied the answers are) all break down on multi-step tasks. A recent paper proposes a different signal, based on whether the parts of an answer actually fit together into the whole. This post walks through what it is, why it matters for operators, and how to wire it into an agent stack.
The source paper is Operadic consistency: a label-free signal for compositional reasoning failures in LLMs. I will use plain language throughout and translate the math into something an operations lead can act on.
If you run an agent that books travel, drafts contracts, reconciles invoices, or answers support tickets, the failure mode that hurts you is not the model saying "I don't know." It is the model returning a clean, well-formatted, wrong answer.
Existing label-free confidence signals try to catch this:
These work reasonably well on single-step questions. They degrade on multi-step reasoning, where the model takes a wrong turn early and then defends that turn consistently across samples. You get high confidence and a wrong answer. That is exactly the case where an operator most needs a warning.

Operad theory is a branch of mathematics for describing how operations compose. Forget the formal definition. The operator-relevant idea is this: if a task decomposes into subtasks, then the answer to the whole task should be consistent with the answers to the parts, combined in the prescribed way.
Concretely, for a reasoning question Q that breaks into steps S1, S2, S3:
If they agree, the model is internally consistent under composition. If they disagree, something broke, either in the whole or in a part. You do not need a ground-truth answer to detect the disagreement. The signal is the gap.
The paper formalizes this as a measure over an operad of subtask decompositions. The operator takeaway: it is a check that the model's part-answers and whole-answer are mutually compatible, and it fires on exactly the cases where self-consistency does not, namely confident wrong reasoning chains.
Self-consistency asks: does the model agree with itself when asked the same question repeatedly? Operadic consistency asks: does the model agree with itself when asked at different levels of granularity? Those are different questions. A model can be stable across rephrasings of Q and still produce a whole answer that contradicts its own subtask answers. That contradiction is the signal.
Here is a side-by-side for an operator. The numbers on cost are order-of-magnitude, based on what each method requires in extra model calls per question.
| Signal | Extra calls per question | Catches confident wrong chains | Needs task decomposition |
|---|---|---|---|
| P(True) self-evaluation | 1 | Poorly | No |
| Self-consistency (k=20) | 20 | Sometimes | No |
| Semantic entropy (k=10) | 10 | Sometimes | No |
| Operadic consistency | 2 to 5 | Yes | Yes |
The trade-off is clear. Operadic consistency is cheaper in calls but requires you to know how the task decomposes. For many production agent workflows, you already know that, because you wrote the prompt that asks for steps.
Let's take a concrete task: an invoice reconciliation agent that has to compute a total adjustment across three line items, then decide whether to flag the invoice for review.
# Run the whole task, then run each part, then check that
# the parts compose to the whole. No ground truth required.
from my_llm import ask
invoice = load_invoice("INV-4471")
# 1. Whole answer
whole_prompt = f"""Given this invoice, compute the total adjustment
and decide flag/no-flag. Return JSON: {{adjustment, flag}}.
Invoice: {invoice}"""
whole = ask(whole_prompt)
# 2. Part answers
parts = []
for item in invoice.line_items:
part_prompt = f"Compute adjustment for this line item. Return a number.\nLine: {item}"
parts.append(float(ask(part_prompt)))
That snippet runs the whole task once, runs each subtask once, and checks that the parts add up to the whole. If they do not, you have a label-free warning that the model contradicted itself, and you route the invoice for human review.
The cost is four extra calls on a three-item invoice. Compare that to twenty calls for self-consistency, and you can see why this is attractive for high-volume workflows.
Most production agent stacks already have three layers: the planner that decomposes the task, the executors that run subtasks, and an aggregator that returns the final answer. Operadic consistency sits naturally as a check between the executors and the aggregator.
flowchart TD
A[User request] --> B[Planner: decomposes into S1..Sn]
B --> C[Executors: answer each Si]
B --> D[Direct executor: answer whole Q]
C --> E[Compose parts by known rule]
D --> F[Whole answer]
E --> G{Consistent?}
F --> G
G -- Yes --> H[Return answer]
G -- No --> I[Route to stronger model or human]The branch on the right ("Direct executor: answer whole Q") is the only new piece. You are already paying for the left branch in any decompositional agent. The extra call is the cost of the check.
You do not need to run this on every call. Useful triggers:
For internal, low-stakes calls (summarizing a meeting, drafting an internal note), the check is overhead you do not need.

The signal does not fix wrong answers. It tells you when to not trust them. That is the operator value: a cheap circuit breaker that lets you spend human review and stronger-model calls only where they are needed.
A reasonable rollout looks like this:
If you already run evals (test sets you score the agent against), operadic consistency gives you something different: a per-call signal in production, not a batch score on a fixed test set. That changes what eval-driven operations means. You stop relying only on offline batch scores and start running a cheap online check on every high-stakes call.
A simple configuration sketch:
# agent-quality-gates.yaml
# Cheap online checks applied per call, per workflow.
workflows:
invoice_reconciliation:
decomposition: per_line_item
composition_rule: sum
tolerance: 0.01
action_on_inconsistency: route_to_reviewer
sample_rate: 1.0
support_triage:
decomposition: per_subquestion
composition_rule: logical_and
action_on_inconsistency: retry_with_stronger_model
sample_rate: 0.25
internal_summary:
enabled: falseThat file is the operator's lever. It declares which workflows get the check, what to do when the check fires, and how often to apply it.
Three honest limits:
These are real, and they are also tractable. The first one tells you where the check belongs (structured workflows) and where it does not (open-ended generation). The second one is why you still need periodic ground-truth evals. The third one is a prompt-engineering exercise you would do anyway.