
How the MSUE soccer VQA system shows operators a repeatable pipeline for fine-tuning vision-language models on proprietary data without large annotation…
No. You need one applied ML engineer who has fine-tuned a VLM before, one data engineer to wire up the ground-truth source, and one subject matter expert who will write the golden evaluation set. The rest is configuration and review. A team of three can have a first version running in six to eight weeks.
For a domain with roughly 10,000 source items (clips, documents, calls), generating three questions per item against a strong commercial VLM is typically in the low thousands of dollars. The fine-tune run is similar. The evaluation harness is engineering time, not API spend.
Then your first project is not an agent, it is wiring your operational systems to your unstructured archive. This is unglamorous and almost always pays for itself even without the agent. The MSUE pattern depends on being able to verify synthetic answers against something you already know. Without that anchor, you are training on guesses.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
Yes, unless you train it to refuse. The canary set in the evaluation harness exists for exactly this reason. You measure refusal rate on out-of-domain questions and you do not deploy a model that confidently answers questions it should decline. This is governance you can actually enforce, not a policy document.
A domain VQA expert like MSUE is a single tool, not a whole agent. In a larger agent stack it sits behind a router: the agent decides a question is about, say, claims footage, and calls the claims expert; for legal questions it calls a different expert. The MSUE pattern is how you build each expert. The orchestration layer is a separate problem, and a simpler one once your experts are reliable.
The MSUE (Multi-Modal Soccer Understanding Expert) submission to the 2026 SoccerNet Visual Question Answering challenge is, on the surface, a sports research paper. Read it as an operator and a different picture emerges: a repeatable recipe for building a domain-specific agent on top of a general vision-language model, using cheap synthetic data and a focused evaluation loop. That pattern, more than the soccer result, is what matters for teams considering vertical AI agents in 2026.
In this post I will walk through what MSUE actually did, translate the moving parts into operator terms, and lay out what it would take to copy the approach for a non-sports business: claims review, retail loss prevention, manufacturing line inspection, telehealth triage. The source paper is MSUE: Multi-Modal Soccer Understanding Expert.
SoccerNet VQA asks a model to watch soccer footage and answer questions about it. The questions range from short factual ones ("which team has possession at minute 34") to long-form explanations ("why did the referee award a free kick here"). The challenge rewards systems that handle both styles and that ground their answers in the video, not in a memorized prior about famous matches.
MSUE is built on a vision-language model (VLM), which is a model that takes images or video plus text as input and produces text as output. Think of it as a general-purpose visual reader. Out of the box, a general VLM knows what a soccer ball looks like but does not know the laws of the game, the tactical vocabulary, or how to read a possession sequence. MSUE closes that gap with two pieces:
The interesting claim, from an operator point of view, is that the data synthesis step is "cost-effective." That means: instead of paying human annotators to write tens of thousands of questions about soccer clips, the team uses a strong VLM to generate those questions from data that already exists. The base model becomes the annotator. Humans review and filter rather than write from scratch.

Most B2B operators do not care about SoccerNet rankings. But most B2B operators do have the same underlying problem MSUE solved: a stack of proprietary data, a workflow where humans currently watch or read that data and answer questions about it, and no budget to label hundreds of thousands of examples by hand.
The MSUE pattern generalizes cleanly:
| Business domain | Raw data already collected | Questions humans answer today | Synthetic VQA equivalent |
|---|---|---|---|
| Auto insurance claims | Damage photos, adjuster notes | "Is this repairable or a total loss?" | Photo plus claim metadata, model is asked to classify and justify |
| Retail loss prevention | CCTV clips, POS logs | "Did this transaction match what happened at the till?" | Clip plus receipt, model is asked to flag mismatch and explain |
| Manufacturing QA | Line camera footage, defect logs | "Which station introduced this defect?" | Video plus log, model is asked to localize and explain |
| Field service | Technician body cam, work orders | "Was the install done to spec?" | Footage plus checklist, model is asked to score and cite frames |
In every row, the raw data exists, the question style is narrow, and the cost of human review is the bottleneck. MSUE's contribution is a template for getting past that bottleneck without a heroic annotation budget.
The headline phrase hides a specific trick. The team has structured data on each match: who scored, when, what type of event happened. They feed that structured data, plus the corresponding video clip, into a strong VLM and prompt it to produce questions and answers that a human evaluator might ask. Because the structured data is the ground truth, they can check the synthesized answers against it and throw out the bad ones.
This is the part operators should copy. If you have any kind of structured log next to your unstructured data (a case management system next to call recordings, a ticketing system next to chat transcripts, a defect database next to line video), you can do the same thing.
# Generate a training question from a structured event and its clip.
# In plain English: ask a strong VLM to write a question whose answer
# we already know from our own database, then keep only the ones it
# answers correctly when we test it back.
def synthesize_vqa(event, clip_path, vlm):
prompt = f"""
You are writing training questions for a {event.domain} expert.
Ground truth: {event.to_json()}
Watch the clip and write 3 question/answer pairs:
1. one short factual question
2. one short numeric or categorical question
3. one long-form 'why' question
Each answer must be verifiable from the ground truth.
"""
candidates = vlm.generate(prompt, video=clip_path)
return [qa for qa in candidates if verify(qa, event)]The verify step is where most teams cut corners and regret it. If you do not check the synthesized answer against your structured source of truth, you are training a model on its own hallucinations. MSUE's reported gains come in large part from this filter.
Here is the loop, as a diagram, with the parts an operator needs to staff and budget for labeled clearly.
flowchart LR
A[Proprietary data:<br/>video, docs, logs] --> B[Structured ground truth:<br/>your existing systems]
A --> C[Strong general VLM<br/>used as annotator]
B --> C
C --> D[Synthesized Q/A pairs]
D --> E[Automated verifier<br/>checks vs ground truth]
E -->|kept| F[Fine-tune base VLM<br/>into domain expert]
E -->|rejected| G[Discard or send<br/>to human review]
F --> H[Evaluation harness:<br/>real questions from ops]
H -->|pass| I[Deploy as agent]
H -->|fail| CRead left to right, the cost profile is: data you already have (free), a few thousand dollars of VLM inference to generate Q/A pairs, an engineer-week to write the verifier, a fine-tune run (low four figures on a rented GPU for a model of this size), and the evaluation harness, which is the part most teams underinvest in.
If you only remember one thing from MSUE for your own org: the evaluation harness is more valuable than the model. The SoccerNet challenge ships with a fixed evaluation set, which is why the field can compare submissions. Inside a business, you have to build that yourself, and it is the asset that makes the difference between an agent you can trust in production and a demo that breaks the first time a real user asks a real question.
A useful evaluation harness for a domain VQA agent has three layers:

To make this concrete, here is a sketch of the configuration an ops team would run to reproduce the MSUE pattern in a non-sports domain. This is not the SoccerNet code; it is the operator-facing wrapper you would build around any base VLM.
# domain-vqa.yaml
# One config that drives synthesis, fine-tune, and eval.
# Owned by the ops lead, not the ML team.
domain: auto_claims
base_model: a-general-purpose-vlm
ground_truth_source:
system: claims_db
table: settled_claims
fields: [claim_id, damage_type, severity, payout, repair_or_total]
synthesis:
clips_per_claim: 3
questions_per_clip: 3
styles: [short_factual, categorical, long_form_why]
verifier: structured_match
keep_threshold: 0.85
fine_tune
The file is boring on purpose. The point is that the operator owns the knobs that matter (which data, which questions, which pass criteria) and the ML team owns the steps in between. That separation is what makes the program survivable when the base model changes next quarter.
To kick off a run from a laptop:
# Synthesize, fine-tune, and evaluate in one command.
# Safe to re-run: each step is cached by config hash.
agenthive run domain-vqa.yaml \
--stage synthesize \
--stage finetune \
--stage evaluate \
--report out/claims-vqa-report.htmlThe HTML report is what the COO actually reads. It shows accuracy on the golden set, failure modes on the canary set, and a per-question diff against the last run. If the numbers regress, you do not deploy.
Three takeaways for operators looking at vertical agents this year.
First, narrow beats general for production work. The MSUE result, like a growing body of recent vertical-agent work, says that a fine-tuned smaller model with a clean evaluation loop outperforms a giant general model on the specific questions your business needs answered. The general model is still useful as the annotator and the grader. It is not necessarily the thing you deploy.
Second, the bottleneck moved. Two years ago it was model capability. One year ago it was inference cost. Today, for any team with proprietary data, the bottleneck is the data pipeline and the evaluation harness. Budget accordingly. If your AI line item is 90 percent model API spend and 0 percent evaluation engineering, you are pointed at last year's problem.
Third, your defensible asset is the loop, not the model. The base VLM you fine-tune today will be replaced by a better one within a year. The structured ground truth, the synthesis pipeline, the golden set, and the canary set carry forward. They are what let you re-fine-tune on the next base model in a week instead of a quarter. Build them like infrastructure, because that is what they are.