Agent Hive mark

Frequently asked questions

Do I need a research team to copy this pattern?

No. You need one applied ML engineer who has fine-tuned a VLM before, one data engineer to wire up the ground-truth source, and one subject matter expert who will write the golden evaluation set. The rest is configuration and review. A team of three can have a first version running in six to eight weeks.

How much does the synthesis step actually cost?

For a domain with roughly 10,000 source items (clips, documents, calls), generating three questions per item against a strong commercial VLM is typically in the low thousands of dollars. The fine-tune run is similar. The evaluation harness is engineering time, not API spend.

What if I do not have structured ground truth next to my unstructured data?

Then your first project is not an agent, it is wiring your operational systems to your unstructured archive. This is unglamorous and almost always pays for itself even without the agent. The MSUE pattern depends on being able to verify synthetic answers against something you already know. Without that anchor, you are training on guesses.

What MSUE actually does

SoccerNet VQA asks a model to watch soccer footage and answer questions about it. The questions range from short factual ones ("which team has possession at minute 34") to long-form explanations ("why did the referee award a free kick here"). The challenge rewards systems that handle both styles and that ground their answers in the video, not in a memorized prior about famous matches.

MSUE is built on a vision-language model (VLM), which is a model that takes images or video plus text as input and produces text as output. Think of it as a general-purpose visual reader. Out of the box, a general VLM knows what a soccer ball looks like but does not know the laws of the game, the tactical vocabulary, or how to read a possession sequence. MSUE closes that gap with two pieces:

A data synthesis pipeline that turns raw soccer data (match logs, commentary, event tags) into question-answer training pairs.
A fine-tuning recipe that teaches the base VLM to produce both short, exact answers and longer, grounded explanations.

The interesting claim, from an operator point of view, is that the data synthesis step is "cost-effective." That means: instead of paying human annotators to write tens of thousands of questions about soccer clips, the team uses a strong VLM to generate those questions from data that already exists. The base model becomes the annotator. Humans review and filter rather than write from scratch.

Why this is an operator story, not a sports story

Most B2B operators do not care about SoccerNet rankings. But most B2B operators do have the same underlying problem MSUE solved: a stack of proprietary data, a workflow where humans currently watch or read that data and answer questions about it, and no budget to label hundreds of thousands of examples by hand.

The MSUE pattern generalizes cleanly:

Business domain	Raw data already collected	Questions humans answer today	Synthetic VQA equivalent
Auto insurance claims	Damage photos, adjuster notes	"Is this repairable or a total loss?"	Photo plus claim metadata, model is asked to classify and justify
Retail loss prevention	CCTV clips, POS logs	"Did this transaction match what happened at the till?"	Clip plus receipt, model is asked to flag mismatch and explain
Manufacturing QA	Line camera footage, defect logs	"Which station introduced this defect?"	Video plus log, model is asked to localize and explain
Field service	Technician body cam, work orders	"Was the install done to spec?"	Footage plus checklist, model is asked to score and cite frames

In every row, the raw data exists, the question style is narrow, and the cost of human review is the bottleneck. MSUE's contribution is a template for getting past that bottleneck without a heroic annotation budget.

What "cost-effective data synthesis" actually means

The headline phrase hides a specific trick. The team has structured data on each match: who scored, when, what type of event happened. They feed that structured data, plus the corresponding video clip, into a strong VLM and prompt it to produce questions and answers that a human evaluator might ask. Because the structured data is the ground truth, they can check the synthesized answers against it and throw out the bad ones.

This is the part operators should copy. If you have any kind of structured log next to your unstructured data (a case management system next to call recordings, a ticketing system next to chat transcripts, a defect database next to line video), you can do the same thing.

# Generate a training question from a structured event and its clip.
# In plain English: ask a strong VLM to write a question whose answer
# we already know from our own database, then keep only the ones it
# answers correctly when we test it back.
 
def synthesize_vqa(event, clip_path, vlm):
 prompt = f"""
 You are writing training questions for a {event.domain} expert.
 Ground truth: {event.to_json()}
 Watch the clip and write 3 question/answer pairs:
 1. one short factual question
 2. one short numeric or categorical question
 3. one long-form 'why' question
 Each answer must be verifiable from the ground truth.
 """
 candidates = vlm.generate(prompt, video=clip_path)
 return [qa for qa in candidates if verify(qa, event)]

The verify step is where most teams cut corners and regret it. If you do not check the synthesized answer against your structured source of truth, you are training a model on its own hallucinations. MSUE's reported gains come in large part from this filter.

The architecture, in operator terms

Here is the loop, as a diagram, with the parts an operator needs to staff and budget for labeled clearly.

flowchart LR
 A[Proprietary data:<br/>video, docs, logs] --> B[Structured ground truth:<br/>your existing systems]
 A --> C[Strong general VLM<br/>used as annotator]
 B --> C
 C --> D[Synthesized Q/A pairs]
 D --> E[Automated verifier<br/>checks vs ground truth]
 E -->|kept| F[Fine-tune base VLM<br/>into domain expert]
 E -->|rejected| G[Discard or send<br/>to human review]
 F --> H[Evaluation harness:<br/>real questions from ops]
 H -->|pass| I[Deploy as agent]
 H -->|fail| C

Read left to right, the cost profile is: data you already have (free), a few thousand dollars of VLM inference to generate Q/A pairs, an engineer-week to write the verifier, a fine-tune run (low four figures on a rented GPU for a model of this size), and the evaluation harness, which is the part most teams underinvest in.

The evaluation harness is the moat

If you only remember one thing from MSUE for your own org: the evaluation harness is more valuable than the model. The SoccerNet challenge ships with a fixed evaluation set, which is why the field can compare submissions. Inside a business, you have to build that yourself, and it is the asset that makes the difference between an agent you can trust in production and a demo that breaks the first time a real user asks a real question.

A useful evaluation harness for a domain VQA agent has three layers:

A frozen golden set of 200 to 500 real questions written by your subject matter experts, with reference answers. This never changes.
An automatic grader, often a stronger model, that scores new model outputs against the reference. You calibrate it once against human grades.
A small "canary" set of adversarial cases: questions designed to trip the model into guessing, hallucinating, or leaking out-of-domain.

A minimal operator setup

To make this concrete, here is a sketch of the configuration an ops team would run to reproduce the MSUE pattern in a non-sports domain. This is not the SoccerNet code; it is the operator-facing wrapper you would build around any base VLM.

# domain-vqa.yaml
# One config that drives synthesis, fine-tune, and eval.
# Owned by the ops lead, not the ML team.
 
domain: auto_claims
base_model: a-general-purpose-vlm
ground_truth_source:
 system: claims_db
 table: settled_claims
 fields: [claim_id, damage_type, severity, payout, repair_or_total]
 
synthesis:
 clips_per_claim: 3
 questions_per_clip: 3
 styles: [short_factual, categorical, long_form_why]
 verifier: structured_match
 keep_threshold: 0.85
 
fine_tune

The file is boring on purpose. The point is that the operator owns the knobs that matter (which data, which questions, which pass criteria) and the ML team owns the steps in between. That separation is what makes the program survivable when the base model changes next quarter.

To kick off a run from a laptop:

# Synthesize, fine-tune, and evaluate in one command.
# Safe to re-run: each step is cached by config hash.
 
agenthive run domain-vqa.yaml \
 --stage synthesize \
 --stage finetune \
 --stage evaluate \
 --report out/claims-vqa-report.html

The HTML report is what the COO actually reads. It shows accuracy on the golden set, failure modes on the canary set, and a per-question diff against the last run. If the numbers regress, you do not deploy.

What this means for your 2026 roadmap

Three takeaways for operators looking at vertical agents this year.

First, narrow beats general for production work. The MSUE result, like a growing body of recent vertical-agent work, says that a fine-tuned smaller model with a clean evaluation loop outperforms a giant general model on the specific questions your business needs answered. The general model is still useful as the annotator and the grader. It is not necessarily the thing you deploy.

Second, the bottleneck moved. Two years ago it was model capability. One year ago it was inference cost. Today, for any team with proprietary data, the bottleneck is the data pipeline and the evaluation harness. Budget accordingly. If your AI line item is 90 percent model API spend and 0 percent evaluation engineering, you are pointed at last year's problem.

Third, your defensible asset is the loop, not the model. The base VLM you fine-tune today will be replaced by a better one within a year. The structured ground truth, the synthesis pipeline, the golden set, and the canary set carry forward. They are what let you re-fine-tune on the next base model in a week instead of a quarter. Build them like infrastructure, because that is what they are.

MSUE: Build Narrow Domain Agents with Synthetic VQA Data