Agent Hive mark

Frequently asked questions

Do I need to retrain my model to use a better router?

No. The whole point of the manifold power iteration approach is that it is post-hoc. You run a calibration pass over a sample of representative inputs, recompute the router rows, and swap them in. The experts themselves are untouched. This is what makes it interesting for teams that did not pretrain their own model.

Is this only relevant to teams running open-weight MoE models?

Mostly yes for the literal technique, since you need access to the router weights. But the diagnostic mindset, watching expert utilization, routing entropy, and tail latency per expert, applies to any system where a dispatcher routes work to specialized workers. That includes agent orchestrators and human-in-the-loop triage queues.

How does this compare to load-balancing losses that are already used during MoE training?

What an MoE router actually does, in business terms

A Mixture-of-Experts (MoE) model is one large model made of many smaller sub-models, called experts. For each input token, only a few experts run. The rest stay idle. That is how a 400-billion-parameter model can answer at the cost of a 40-billion-parameter one.

The router is the part that decides which experts fire. Think of it as a dispatcher in a call center. Every incoming call gets matched to two or three specialists out of a hundred. If the dispatcher is good, calls land with the right specialist on the first try. If the dispatcher is lazy, the same three people answer everything and the other ninety-seven get paid to wait.

That dispatch decision has direct cost consequences:

Unused experts are still loaded in memory. You paid for the GPU memory.
Overused experts become latency bottlenecks. Your p99 latency, the slowest 1 percent of requests, gets ugly.
A poorly trained router silently degrades quality, and you only see it in eval drift weeks later.

Router as dispatcher in front of a pool of experts

Why current routers drift

In a standard MoE setup, the router is a small matrix. Each row of that matrix is supposed to represent one expert. When a token comes in, the router computes a similarity score between the token and each row, picks the top few, and sends the token to those experts.

The problem: nobody actually forces each row to be a faithful summary of what its expert does. The rows are trained jointly with everything else, and they tend to cluster. Several rows end up pointing in almost the same direction. That is the matrix-algebra version of three dispatchers all sending calls to the same specialist.

The manifold power iteration idea, without the math

The paper's proposal is to replace the trained router with one that is computed directly from each expert's behavior. The recipe, in plain English:

For each expert, collect the inputs it processes during a calibration pass.
Find the single direction that best summarizes those inputs. That direction becomes the expert's row in the router.
Constrain all those rows to live on a manifold (a curved surface where the rows are forced to stay distinct from each other).
Iterate until the rows stop moving.

The "power iteration" part is an old, well-understood numerical trick for finding the dominant direction of a dataset. The "manifold" part is the constraint that keeps the expert rows from collapsing onto each other. Together, they give you a router where each row genuinely encodes one expert's specialty, and no two rows are duplicates.

The operator translation: you get a dispatcher who has actually shadowed every specialist for a day, and who is forbidden from confusing any two of them.

 Standard router rows Manifold-constrained rows
 (clustered, redundant) (spread, distinct)
 
 e1 e1
 e2 e2
 e3 e6 e3
 e4 e5 e6 e5 e4

What this buys you, by the numbers the paper reports

The paper evaluates on standard MoE benchmarks. Headline observations:

Expert utilization becomes more even. The gap between the busiest and idlest expert shrinks.
Downstream task accuracy holds or improves slightly versus the trained router baseline.
The new router can be computed post-hoc, without retraining the experts themselves.

That last point is the one that matters for an operator. You do not need to retrain the whole model. You compute a better router on top of an existing checkpoint.

Comparison: three router strategies side by side

Router strategy	Training cost	Expert balance	When to use
Standard learned router	Trained jointly with experts; no extra step	Often skewed, hot experts emerge	Default in most open MoE checkpoints
Auxiliary load-balancing loss	Adds a balancing penalty during training	Better balance, can hurt accuracy	When you control pretraining
Manifold power iteration (this paper)	One calibration pass, no retraining	Strong balance, accuracy preserved	When you inherited a model and want to fix routing now

The third row is the one most teams are actually in. You did not pretrain the model. You are running someone else's checkpoint. You want better economics without a six-figure training run.

What this looks like in a serving stack

You do not have to implement the math yourself to benefit from this line of research. But it helps to know where in your stack a router lives, so you can ask the right questions of your inference vendor or your platform team.

# Minimal sketch: how a router sits inside an MoE layer.
# This is the dispatch step that decides which experts run for each token.
 
import torch
import torch.nn.functional as F
 
def moe_forward(tokens, router_weights, experts, top_k=2):
 # tokens: [batch, hidden_dim]
 # router_weights: [num_experts, hidden_dim], one row per expert
 scores = tokens @ router_weights.T # similarity to each expert
 top_scores, top_idx = scores.topk(top_k, dim=-1) # pick top-k experts per token
 gates = F.softmax(top_scores, dim=-1) # how much weight per expert
 
 output = torch.zeros_like(tokens)
 for k in range(top_k):
 for e in range(

The single line that the paper changes is router_weights. Everything else stays. That is why this kind of work is interesting to an operator: the surface area of the change is small, the blast radius is contained, and the benefit shows up in metrics you already track.

The metrics you should ask your platform team for

If you serve an MoE model, ask for these numbers, weekly:

Per-expert activation rate. The histogram should be flattish, not a long tail.
Tokens-per-second per expert. Hot experts are your latency tail.
Routing entropy. Higher means the router is using more of the pool.
Cost per million tokens, broken down by which experts fired.

# Example observability config for an MoE serving deployment.
# Each metric below maps to a business question your COO will ask.
 
metrics:
 - name: expert_activation_rate
 question: "Are we paying for experts that never run?"
 alert_if: "min_rate < 0.01 for 24h"
 
 - name: expert_tokens_per_second
 question: "Which expert is our latency bottleneck?"
 alert_if: "max / median > 5"
 
 - name: routing_entropy
 question: "Is the router collapsing onto a few experts?"
 alert_if: "entropy < 0.7 * log(num_experts)"
 
 - name: cost_per_million_tokens
 question: "Did the last router update change unit economics?"
 track: "weekly, by model version"

Dashboard view of expert activation, latency, and entropy

Why this matters beyond the model layer

This is a blog about agentic organizations, not about kernel-level MoE plumbing. So why spend a post on a router paper?

Because the same failure mode shows up one level up. Replace "experts" with "agents" or "vendors" or "teams." Replace "router" with "orchestrator" or "dispatcher" or "triage layer." The pattern is identical.

flowchart LR
 A[Incoming work] --> R{Router / orchestrator}
 R -->|score| E1[Agent A: refunds]
 R -->|score| E2[Agent B: billing]
 R -->|score| E3[Agent C: technical]
 R -->|score| E4[Agent D: escalations]
 R -->|score| E5[Agent E: general]
 
 E1 --> O[Resolved ticket]
 E2 --> O
 E3 --> O
 E4 --> O
 E5 --> O
 
 style R fill:#ddd,stroke:#333

In an agent operating model, the orchestrator decides which specialist agent handles a piece of work. If the orchestrator is sloppy, you get the same pathology the paper describes: one or two agents handle everything, the rest are dead weight, and you cannot tell whether your other agents are bad or just starved.

The three operator questions

Whether you are looking at MoE routing or agent dispatch, the same three questions decide whether the system is healthy:

Does every specialist get enough work to justify keeping it warm?
Is the dispatcher's idea of each specialist actually based on what that specialist does, or on a stale assumption from setup day?
When you add a new specialist, does the dispatcher route to it, or quietly ignore it?

The manifold power iteration method is, in essence, a clean answer to question 2 at the model layer. The lesson generalizes: dispatchers need a periodic recalibration against the actual behavior of the workers they route to. In agent systems, that means re-scoring agents against recent traces, not against the prompt you wrote them six months ago.

What to do this quarter

A short list, ordered by effort:

If you serve an MoE model, ask your inference vendor whether they expose per-expert utilization metrics. If they cannot, that is a signal about how much they understand the workload.
Add routing entropy and per-expert activation to your weekly review. Treat a collapsing distribution the same way you would treat a single support agent handling 80 percent of tickets.
If you run an agent system, write down what each agent is supposed to be good at, and audit the last 1,000 routed tasks against that description. The router-row-versus-expert mismatch is the same bug in a different layer.
Watch for follow-up work on post-hoc router calibration. The economics, fix the router without retraining the model, are very favorable for operators who run inherited checkpoints.

None of this requires you to read the math. It requires you to treat the dispatcher as a first-class component, not as glue.

Redesign MoE Routers with Manifold Power Iteration