Agent Hive mark

Frequently asked questions

Do I need reinforcement learning to use this idea?

No. The research uses reinforcement learning because that is the formal setting, but the operator pattern, shared preference vector, joint evaluation, profile-driven runtime, works for prompt-based agents, tool-using agents, and classical workflow systems. Reinforcement learning helps when you have enough volume and signal to train policies that generalize across preferences automatically.

How many objectives can a team realistically handle?

In production today, two to four. Past that, the preferences become hard to articulate, hard to instrument, and hard to govern. If you have eight objectives on paper, you have two or three real ones and a list of constraints. Treat the constraints as hard limits, not weights.

What if my objectives are not measurable?

Then you do not have objectives yet; you have intentions. Before adopting any multi-objective approach, instrument the outcomes. A rough numeric proxy that is logged consistently beats a precise definition that is never measured. You can refine the metric later; you cannot refine a blank.

The business problem under the math

Multi-objective multi-agent reinforcement learning (MOMARL) is a long name for a familiar situation. You have several agents doing work together. They are graded on more than one number. Those numbers pull in different directions.

A concrete example. A procurement workflow has three agents: a sourcing agent that finds suppliers, a negotiation agent that talks to them, and a compliance agent that checks contracts. The business cares about price, delivery time, and risk. Each agent has its own view and its own controls. None of them sees the whole picture, and the three objectives trade off against each other.

If you train, or prompt, each agent to maximize its own score, you get three confident specialists producing an incoherent outcome. The sourcing agent picks the cheapest supplier. The negotiation agent locks in the fastest delivery, ignoring price. The compliance agent rejects both because the supplier failed a check that the other two never read. The team produces a worse decision than one mediocre human would.

The research question is simple to state and hard to solve: how do you get a team of agents to agree on a tradeoff, in advance and at runtime, when each agent only sees a slice of the world?

Three agents pulling on overlapping objectives

What "coordinated preference" actually means

The paper's contribution is a way to learn a shared preference over objectives, coordinated across agents, instead of bolting weights on at the end. In plain English: the agents are trained to share an understanding of which tradeoffs the business prefers in which situations, not just which actions are locally good.

A few terms, defined once:

Objective: a number the business cares about (revenue, latency, error rate, customer effort).
Preference: a weighting that says "in this scenario, objective A matters twice as much as objective B."
Pareto front: the set of outcomes where you cannot improve one objective without hurting another. Operators usually call this "the menu of acceptable tradeoffs."
Scalarization: collapsing multiple objectives into one score using preferences. This is the step that goes wrong when preferences are not coordinated.

The standard approach is to pick a preference vector, scalarize, and train. That gives you one point on the Pareto front. Change the business priority, retrain. Coordinated preference learning trains the team to generalize across preferences, so at runtime you can hand the system a different weighting (high urgency today, cost-sensitive tomorrow) and get sensible behavior without retraining.

For an operator, that is the part to remember. The output is not a single policy; it is a family of policies indexed by your current business priorities.

A comparison: three ways to handle multi-objective agent teams

Most teams I talk to are doing one of three things, often without naming it. The third row is what the paper enables.

Approach	What the team does	When it breaks	Operator cost
Single objective, hope for the best	Pick the most important number; train or prompt agents for it	When the ignored objectives degrade enough to draw attention (usually compliance or cost)	Surprise incidents; rework; trust loss
Fixed weighted sum	Choose weights once, scalarize, train one policy	When business priorities shift (peak season, incident, new regulation)	Retraining cycles; stale behavior; shadow policies in prompts
Coordinated preference	Train across a distribution of preferences; pass current weights at runtime	When you cannot articulate any preferences at all, or objectives are not measurable	Up-front work to define and instrument objectives

The third row is more work to set up and less work to run. The first two rows are the reverse, which is why most teams end up there and then spend the savings on incidents.

How the coordination actually happens

You can skip this section if you only care about the operator decision. If you are going to brief an engineering lead, it helps to know the shape of the mechanism.

Each agent learns a policy conditioned on a preference vector. During training, preferences are sampled from a distribution, so the agent sees many tradeoff regimes. A coordination signal, shared across agents, keeps the agents' interpretations of the same preference vector aligned. Without that signal, agent A might read "weight 0.7 on cost" as "be cheap" while agent B reads it as "be fast and let someone else worry about cost."

flowchart LR
 P[Preference vector<br/>cost, speed, risk] --> A1[Agent 1<br/>sourcing]
 P --> A2[Agent 2<br/>negotiation]
 P --> A3[Agent 3<br/>compliance]
 A1 --> J[Joint action]
 A2 --> J
 A3 --> J
 J --> R[Joint reward<br/>scalarized by P]
 R -.coordination signal.-> A1
 R -.coordination signal.-> A2
 R -.coordination signal.-> A3

The dotted line is the part that is new. Each agent gets feedback not on its own score under its own reading of the preferences, but on the team's score under the shared preferences. That is what gets the three agents to stop arguing.

In a language-model agent setup, you do not need reinforcement learning to apply the idea. You need three things: a written preference vector that all agents read on every turn, a joint evaluation that scores the team output (not individual outputs), and a coordinator that updates the preference vector when the business situation changes.

A minimal operator setup

Here is a small, runnable sketch of what this looks like for a non-reinforcement-learning agent team. It is in Python, but the shape is what matters: one preference object, all agents read it, one joint evaluator.

# Defines what the business currently cares about, summing to 1.
preferences = {
 "cost": 0.5,
 "speed": 0.3,
 "risk": 0.2,
}
 
# Each agent receives the same preference vector in its system prompt.
def agent_prompt(role: str, prefs: dict) -> str:
 weights = ", ".join(f"{k}={v}" for k, v in prefs.items())
 return (
 f"You are the {role} agent. Current business priorities: {weights}. "
 f"Make choices that the team, scored jointly, would prefer under these weights."

That snippet is doing one thing for the reader: making every agent answerable to the same scoreboard, and making that scoreboard adjustable without changing the agents.

You can drive the preference vector from anywhere: a config file, a feature flag service, an incident response runbook. The point is that "what we care about right now" is data, not code, and not buried in prompts.

# preferences.yaml, loaded at the start of each workflow run
default:
 cost: 0.5
 speed: 0.3
 risk: 0.2
 
peak_season:
 cost: 0.2
 speed: 0.6
 risk: 0.2
 
post_incident:
 cost: 0.2
 speed: 0.2
 risk: 0.6

When your operations lead declares peak season, you flip a profile. The agents do not get retrained, re-prompted by hand, or argued with. They read the new weights on the next run.

Preference profiles flowing into a multi-agent workflow

What evaluation looks like when objectives compete

The single biggest mistake teams make here is keeping per-agent dashboards and calling it evaluation. Per-agent scores are useful for debugging. They are misleading for decisions.

You need three layers:

Per-agent metrics: did each agent produce something well-formed? This catches breakage.
Joint scalarized score: under the current preferences, how good was the team output? This is the number you report to the business.
Pareto coverage: across the preferences you actually use, how much of the achievable tradeoff space does your team cover? This is the number that tells you whether the system is flexible or brittle.

A team that scores well on the first layer and badly on the second is a team that is busy and useless. A team that scores well on the second but badly on the third is a team that works today and will fail the first time the business priorities shift. Eval-driven operations means you watch all three.

A note on Pareto coverage

You do not need to compute a true Pareto front to use this idea. Run the workflow under, say, five preference profiles you would realistically use. Plot the joint outcomes. If they cluster, your team only knows one trick. If they spread out and each profile produces a sensible result for its weighting, your team is coordinated.

Where this fits in the agentic org

Most discussion of agent operating models focuses on autonomy and oversight: who can act, who must approve, who audits. Coordinated preference adds a third dimension: who decides what the team is optimizing for, and how that decision propagates.

In practice, the preference vector becomes a governance artifact. It is small, readable, and reviewable. Compliance can sign off on the risk weight floor. Finance can sign off on the cost weight under different conditions. Operations owns the runtime profile. The agents read the result.

This is a more honest version of what most companies do with prompts today. Today, the tradeoffs are scattered across ten system prompts, edited by whoever was on call. Tomorrow, they are one object, with version history.

A few practical questions to ask your team this quarter:

What are the three to five objectives our agent workflows are actually trading off?
Are those objectives instrumented well enough to score outcomes after the fact?
Who owns each weight, and how is a weight change approved?
Under what conditions do we change profiles, and how fast can we do it?
When was the last time we measured joint outcomes, not per-agent outcomes?

If you cannot answer the first one, you are not ready for multi-agent workflows yet, no matter how good the models get. If you can answer the first one but not the others, the research direction in the paper is pointed at exactly the gap you have.

Coordinated Preference Learning for Multi-Agent AI Teams