
Manifold power iteration retrains Mixture-of-Experts router rows to reflect actual expert behavior, reducing dead experts, memory waste, and p99 latency.
No. The whole point of the manifold power iteration approach is that it is post-hoc. You run a calibration pass over a sample of representative inputs, recompute the router rows, and swap them in. The experts themselves are untouched. This is what makes it interesting for teams that did not pretrain their own model.
Mostly yes for the literal technique, since you need access to the router weights. But the diagnostic mindset, watching expert utilization, routing entropy, and tail latency per expert, applies to any system where a dispatcher routes work to specialized workers. That includes agent orchestrators and human-in-the-loop triage queues.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
Load-balancing losses penalize the model during training if expert usage is too uneven. They help, but they can trade off accuracy for balance, and they require you to be doing the training. The manifold approach reframes the problem: instead of penalizing imbalance, it constructs router rows directly from expert behavior, after training, with a constraint that keeps them distinct.
The main risk is silent quality regression on tasks that were not in your calibration set. The mitigation is the same as any model change: shadow traffic, eval suite, gradual rollout. Treat the router as a versioned artifact with its own changelog, not as a hidden weight inside the checkpoint.
Below "have evals at all" and above "fine-tune the base model." Most teams have neither solid evals on their orchestrator nor a clear picture of which agents are actually being used. Fix those first. The router-recalibration mindset is the next layer up: once you can measure dispatch quality, you can improve it without rebuilding the workers underneath.
If you run an AI product, the cheapest performance gain is rarely a bigger model. It is usually a smarter dispatcher in front of the model you already have. A new paper, Redesign Mixture-of-Experts Routers with Manifold Power Iteration, makes that case at the level of the model architecture. The same principle scales up to how you route work across agents, vendors, and humans.
A Mixture-of-Experts (MoE) model is one large model made of many smaller sub-models, called experts. For each input token, only a few experts run. The rest stay idle. That is how a 400-billion-parameter model can answer at the cost of a 40-billion-parameter one.
The router is the part that decides which experts fire. Think of it as a dispatcher in a call center. Every incoming call gets matched to two or three specialists out of a hundred. If the dispatcher is good, calls land with the right specialist on the first try. If the dispatcher is lazy, the same three people answer everything and the other ninety-seven get paid to wait.
That dispatch decision has direct cost consequences:

In a standard MoE setup, the router is a small matrix. Each row of that matrix is supposed to represent one expert. When a token comes in, the router computes a similarity score between the token and each row, picks the top few, and sends the token to those experts.
The problem: nobody actually forces each row to be a faithful summary of what its expert does. The rows are trained jointly with everything else, and they tend to cluster. Several rows end up pointing in almost the same direction. That is the matrix-algebra version of three dispatchers all sending calls to the same specialist.
The paper's proposal is to replace the trained router with one that is computed directly from each expert's behavior. The recipe, in plain English:
The "power iteration" part is an old, well-understood numerical trick for finding the dominant direction of a dataset. The "manifold" part is the constraint that keeps the expert rows from collapsing onto each other. Together, they give you a router where each row genuinely encodes one expert's specialty, and no two rows are duplicates.
The operator translation: you get a dispatcher who has actually shadowed every specialist for a day, and who is forbidden from confusing any two of them.
Standard router rows Manifold-constrained rows
(clustered, redundant) (spread, distinct)
e1 e1
e2 e2
e3 e6 e3
e4 e5 e6 e5 e4The paper evaluates on standard MoE benchmarks. Headline observations:
That last point is the one that matters for an operator. You do not need to retrain the whole model. You compute a better router on top of an existing checkpoint.
| Router strategy | Training cost | Expert balance | When to use |
|---|---|---|---|
| Standard learned router | Trained jointly with experts; no extra step | Often skewed, hot experts emerge | Default in most open MoE checkpoints |
| Auxiliary load-balancing loss | Adds a balancing penalty during training | Better balance, can hurt accuracy | When you control pretraining |
| Manifold power iteration (this paper) | One calibration pass, no retraining | Strong balance, accuracy preserved | When you inherited a model and want to fix routing now |
The third row is the one most teams are actually in. You did not pretrain the model. You are running someone else's checkpoint. You want better economics without a six-figure training run.
You do not have to implement the math yourself to benefit from this line of research. But it helps to know where in your stack a router lives, so you can ask the right questions of your inference vendor or your platform team.
# Minimal sketch: how a router sits inside an MoE layer.
# This is the dispatch step that decides which experts run for each token.
import torch
import torch.nn.functional as F
def moe_forward(tokens, router_weights, experts, top_k=2):
# tokens: [batch, hidden_dim]
# router_weights: [num_experts, hidden_dim], one row per expert
scores = tokens @ router_weights.T # similarity to each expert
top_scores, top_idx = scores.topk(top_k, dim=-1) # pick top-k experts per token
gates = F.softmax(top_scores, dim=-1) # how much weight per expert
output = torch.zeros_like(tokens)
for k in range(top_k):
for e in range(
The single line that the paper changes is router_weights. Everything else stays. That is why this kind of work is interesting to an operator: the surface area of the change is small, the blast radius is contained, and the benefit shows up in metrics you already track.
If you serve an MoE model, ask for these numbers, weekly:
# Example observability config for an MoE serving deployment.
# Each metric below maps to a business question your COO will ask.
metrics:
- name: expert_activation_rate
question: "Are we paying for experts that never run?"
alert_if: "min_rate < 0.01 for 24h"
- name: expert_tokens_per_second
question: "Which expert is our latency bottleneck?"
alert_if: "max / median > 5"
- name: routing_entropy
question: "Is the router collapsing onto a few experts?"
alert_if: "entropy < 0.7 * log(num_experts)"
- name: cost_per_million_tokens
question: "Did the last router update change unit economics?"
track: "weekly, by model version"
This is a blog about agentic organizations, not about kernel-level MoE plumbing. So why spend a post on a router paper?
Because the same failure mode shows up one level up. Replace "experts" with "agents" or "vendors" or "teams." Replace "router" with "orchestrator" or "dispatcher" or "triage layer." The pattern is identical.
flowchart LR
A[Incoming work] --> R{Router / orchestrator}
R -->|score| E1[Agent A: refunds]
R -->|score| E2[Agent B: billing]
R -->|score| E3[Agent C: technical]
R -->|score| E4[Agent D: escalations]
R -->|score| E5[Agent E: general]
E1 --> O[Resolved ticket]
E2 --> O
E3 --> O
E4 --> O
E5 --> O
style R fill:#ddd,stroke:#333In an agent operating model, the orchestrator decides which specialist agent handles a piece of work. If the orchestrator is sloppy, you get the same pathology the paper describes: one or two agents handle everything, the rest are dead weight, and you cannot tell whether your other agents are bad or just starved.
Whether you are looking at MoE routing or agent dispatch, the same three questions decide whether the system is healthy:
The manifold power iteration method is, in essence, a clean answer to question 2 at the model layer. The lesson generalizes: dispatchers need a periodic recalibration against the actual behavior of the workers they route to. In agent systems, that means re-scoring agents against recent traces, not against the prompt you wrote them six months ago.
A short list, ordered by effort:
None of this requires you to read the math. It requires you to treat the dispatcher as a first-class component, not as glue.