Agent Hive mark

Frequently asked questions

Do I need a PhD on staff to use this?

No, but you need someone who can write the workflow spec carefully. That is closer to writing detailed acceptance criteria than to formal logic. The synthesizer handles the hard part. Expect to assign this to a senior engineer or a technical product manager, not a research scientist.

How does this differ from running evals on my agent system?

Evals measure how often your system succeeds on a sample of cases. Synthesis proves whether success is achievable across all cases consistent with your model. You want both: evals for content quality on individual steps, synthesis for coordination correctness across steps. Skipping synthesis is how you ship a system that passes 99% of evals and fails on the 1% that costs you the relationship.

What kinds of workflows are worth the investment?

The business question hiding inside "strategy synthesis"

Strategy synthesis is a formal term, but the operator version is simple: given a set of agents, a set of rules, and a goal, can the agents reach the goal no matter what the environment does, and what action should each agent take at each step?

That maps directly to decisions you already make:

Can my two pricing agents and one inventory agent guarantee margin stays above a floor, even if a competitor drops prices?
Can my support routing agent and refund agent together resolve a ticket within SLA without breaching the refund policy?
If a fraud-check agent disagrees with a payments agent, is there a coordination rule that still clears legitimate transactions on time?

The academic field calls the language for asking these questions Alternating-time Temporal Logic, or ATL. ATL lets you write statements like "this coalition of agents has a strategy to ensure the goal eventually holds." You do not need to read ATL formulas to benefit from the underlying idea: there is a precise way to ask whether your agent team can win, and a precise way to extract the playbook if they can.

The catch, and the reason most teams never touch this, is that solving these questions exactly is expensive. The state space (every combination of what each agent knows and could do) blows up quickly. That is the gap the new work targets.

Strategy synthesis as an operator question

Why symbolic alone is too slow and neural alone is too risky

There are two well-understood ways to attack strategy synthesis, and each fails in a way that operators feel directly.

The symbolic route

Symbolic methods (model checking, SAT solvers, fixpoint computation) explore the state space exhaustively. They give you guarantees: if they say a strategy exists, it does, and they hand you the strategy. The cost is time and memory. For non-trivial agent counts, they do not finish in any window you would accept for a product decision.

In operator terms: you get a correct answer next quarter, for a question you needed answered before standup.

The neural route

Pure learning methods (reinforcement learning, policy networks) scale better. You train, you get a policy, you deploy. The cost is that the policy is a black box. It may work on the cases you trained on and silently fail on the ones you did not. There is no certificate that says "this agent team will not violate the refund policy."

In operator terms: you ship something fast, then your risk and compliance teams will not sign off, or worse, they sign off and you find out at 3am.

The neuro-symbolic compromise

The approach in the paper uses a neural network to guide the symbolic search. The neural component proposes candidate strategies or prunes branches that look unpromising; the symbolic component verifies. You keep the guarantee (because the symbolic layer is still the source of truth) and you cut the runtime (because the neural layer skips most of the dead ends).

Approach	Runtime on real workflows	Gives you a proof	Risk profile for production
Symbolic only (model checking)	Hours to days, often does not finish	Yes	Safe but unshippable
Neural only (learned policy)	Seconds at inference	No	Fast but unauditable
Neuro-symbolic	Minutes, varies with problem	Yes (verified)	Shippable and auditable
Manual rules (status quo)	Instant	Implicit only	Brittle, drifts with the business

The headline for an operator: this is the first family of techniques that is plausibly fast enough to use during a sprint and rigorous enough to put in front of an auditor.

Where this fits in an agent operating model

Most teams adopting agents today are at one of three maturity levels. The technique matters differently at each.

Single agent, human in the loop. You do not need strategy synthesis. You need good evals.
Multiple agents, scripted handoffs. You have written the coordination by hand. Strategy synthesis is a way to check that your handoff rules actually achieve the business goal under adversarial conditions.
Multiple agents, autonomous coordination. The agents decide who does what. Here you need synthesis, not just checking. You want the system to produce the coordination policy from the goal and the constraints.

The third case is where agentic organizations are heading, and where the cost of getting it wrong is largest. A claims operation that lets two agents negotiate a settlement should be able to prove, before deployment, that no sequence of interactions breaches the reserve policy.

flowchart LR
 G[Business goal in plain English] --> F[Formal spec: ATL formula]
 R[Policy constraints] --> F
 M[Agent capability model] --> S[Neuro-symbolic synthesizer]
 F --> S
 S -->|strategy exists| P[Verified per-agent policy]
 S -->|no strategy| X[Counterexample: which constraint blocks the goal]
 P --> D[Deploy to agent runtime]
 X --> E[Operator decision: relax constraint or change goal]

The counterexample branch is the part operators undervalue. When synthesis fails, it tells you exactly which constraint is impossible to satisfy alongside the goal. That is a business artifact: it is a memo to legal, or to the product owner, saying "we cannot promise both X and Y, pick one."

A worked example: a two-agent refund workflow

Consider a refund agent and a fraud-check agent. The goal: every legitimate refund completes within 60 seconds, and no fraudulent refund completes at all. The constraints: the fraud-check agent must approve before payout, the refund agent can re-query once if fraud-check returns "uncertain," and total queries to the fraud model are capped at two per ticket for cost reasons.

You would normally write this as procedural code and hope for the best. The synthesis framing makes the question explicit.

# Plain-English: describe the agents, their actions, and the goal,
# then ask the synthesizer whether a winning strategy exists.
from agenthive.synth import Agent, Game, synthesize
 
refund = Agent(
 name="refund",
 actions=["request_check", "payout", "deny", "wait"],
)
fraud = Agent(
 name="fraud",
 actions=["approve", "reject", "uncertain"],
)
 
game = Game(
 agents=[refund, fraud],
 constraints=[
 "fraud_calls <= 2",
 "payout implies fraud.last_action == approve"

If result.status is winning, you get a per-agent policy you can compile into your agent runtime. If it is no_strategy, you get a trace showing the case the system cannot handle: for example, "uncertain" followed by a second "uncertain" exhausts the query budget before resolution. That trace is what you take to the product owner.

What this replaces in your current workflow

Today, the equivalent work is some mix of: a senior engineer reading the code, a QA team running scripted scenarios, and an incident every few months that reveals an interaction nobody modeled. Synthesis does not eliminate testing, but it does close the category of "we never thought of that combination" failures, because the symbolic search considers all combinations by construction.

What to put in your eval and governance stack

If you are evaluating whether to invest in this class of tooling, here is the shortlist of operational changes it implies.

Treat each agent workflow as a specification, not just code. Write the goal and the constraints in a form your tooling can read. This is the same discipline as writing acceptance criteria, just executable.
Run synthesis as part of CI for any workflow change. The runtime is now plausibly compatible with a pull-request gate.
Capture the counterexamples. They are the most useful artifact for product and risk reviews, more useful than test results, because they describe the boundary of what the system can promise.
Keep evals for things synthesis cannot handle: model quality on individual tasks, hallucination rates, tone. Synthesis verifies coordination, not content.

A minimal CI hook looks like this:

# Plain-English: on every change to an agent workflow, re-run synthesis
# and fail the build if the team can no longer guarantee the goal.
name: agent-workflow-verification
on:
 pull_request:
 paths:
 - "workflows/**/*.yaml"
 - "agents/**/*.py"
jobs:
 verify:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v4
 - name: Run strategy synthesis
 run: |
 agenthive synth verify \
 --workflow workflows/refund.yaml \
 --method neuro_symbolic \
 --timeout 10m \
 --fail-on no_strategy
 - name: Upload counterexample if any
 if: failure()
 uses: actions/upload-artifact@v4
 with:

This is the operating model shift: verification of agent coordination becomes a build step, the same way unit tests became a build step a decade ago.

What it does not do

Three honest limits.

First, the technique is sensitive to how you model the world. If your agent capability model is wrong (you say the fraud-check agent always returns within 2 seconds when it sometimes takes 30), synthesis verifies the wrong system. The discipline of writing the model is itself the work; the tool does not invent it.

Second, scale is still a real concern. The paper reports meaningful improvements over symbolic baselines, but problems with dozens of agents and rich state remain hard. For now, target this at the high-stakes coordination points (the three or four agents that touch money or compliance), not your entire org chart.

Third, the neural component needs training data or at least a representative game distribution. For brand new workflows, you may be in cold-start territory and fall back to slower symbolic search. Budget for that.

Neuro-Symbolic Strategy Synthesis for Multi-Agent Systems