Agent Hive mark

Frequently asked questions

Do I need a machine learning team to do this?

You need one engineer who has done a fine-tune before, or who is willing to follow a tutorial carefully. The harder roles are the domain experts who define correctness and the operator who decides what to ship. Those people you already have.

Why not just keep using a frontier API and prompt better?

For generic tasks, prompting is fine. For tasks where the value comes from knowing your domain (your contracts, your customers, your jargon), prompts plateau. Fine-tuning encodes the knowledge into the weights instead of paying to re-explain it in every request. Costs invert: high up-front, low per call.

How big does the training set need to be?

For a LoRA adapter on a task like translation or classification, a few thousand high-quality examples is often enough to beat a much larger generic model. Quality matters more than quantity once you are past about a thousand examples. The CCL25 team's contribution was the dataset itself, which tells you where the work goes.

What the paper actually shows

The task has three parts: translate classical Chinese poems into modern Chinese, identify the emotional register, and answer comprehension questions about meaning. Frontier models do this passably. They miss on the parts that require knowing the canon: allusions, fixed metaphors, period-specific vocabulary.

The team did two things. First, they built a new training dataset specifically for the task, combining existing poetry corpora with question-answer pairs they generated and curated. Second, they fine-tuned Qwen2.5, an open-weights model from Alibaba, using LoRA (Low-Rank Adaptation, a method that trains a small set of extra parameters instead of the whole model). The result was a competitive submission on the official leaderboard.

The interesting part for operators is not the score. It is the shape of the work: a focused dataset, a small training run, an open base model, and an evaluation harness the team controlled end to end.

Why LoRA matters to a non-technical buyer

Full fine-tuning of a 7B parameter model means updating roughly 7 billion numbers. That requires multiple high-end GPUs, careful memory management, and a serious bill. LoRA freezes the original weights and trains a small adapter on top, often less than 1 percent of the parameter count. The practical consequences:

One GPU is usually enough. A single rented A100 for a few hours, not a cluster for a week.
The adapter is small (tens to hundreds of megabytes). You can keep dozens of them for different tasks and swap them in at serving time.
You can roll back. If a new adapter underperforms, the base model is untouched.
The base model stays generic, so you can keep using it for other tasks in parallel.

Here is what the cost shape looks like in practice for a domain fine-tune of this size.

Approach	Hardware	Wall-clock time	Approx. cost	Who owns the result
API prompt engineering only	None	Days of iteration	Per-token fees forever	Vendor
Full fine-tune of 7B model	4-8 A100s	1-3 days	$2k-$10k per run	You, but heavy to retrain
LoRA fine-tune of 7B model	1 A100	4-12 hours	$50-$300 per run	You, easy to retrain
LoRA fine-tune of 14B model	1 H100	6-18 hours	$150-$600 per run	You, easy to retrain

The numbers are illustrative, not from the paper. The point is the shape: LoRA puts domain fine-tuning inside the discretionary budget of a single team lead.

The dataset is the moat, not the model

The team's headline contribution is the dataset, not the training recipe. This is the pattern that repeats across every serious domain fine-tune we see. The base model is a commodity. The proprietary data, the curation choices, and the evaluation set are the things competitors cannot copy quickly.

If you are an operator looking at a domain task (contract review, claims triage, support deflection, internal knowledge retrieval), the work that matters is not picking the model. It is:

Defining what "correct" means in your domain, as concrete examples.
Collecting or generating enough of those examples to train on, usually a few thousand at minimum.
Holding out a clean evaluation set you trust, before you train anything.
Running the fine-tune, which is now the cheapest step.
Re-running evaluation and deciding whether the adapter ships.

The paper does steps 1 through 3 carefully and steps 4 through 5 mechanically. That ordering is the lesson.

A minimal data schema

Most domain fine-tunes come down to instruction, input, output triples. For the poetry task the structure looks roughly like this.

{
 "instruction": "Translate the following classical Chinese poem into modern Chinese and identify its dominant emotion.",
 "input": "床前明月光,疑是地上霜。举头望明月,低头思故乡。",
 "output": {
 "translation": "明亮的月光洒在床前,好像地上结了一层霜。抬头望着天上的明月,低头思念起故乡。",
 "emotion": "homesickness",
 "rationale": "The final couplet explicitly contrasts looking up at the moon with thinking of home."
 }
}

That schema, repeated a few thousand times with real examples and real domain expertise, is the entire training set. For your business, swap the poem for a contract clause, a support ticket, or a lab note. The shape is the same.

What the training loop looks like

This is the part most operators never see, so it stays mythological. It should not. A LoRA fine-tune of Qwen2.5 using the Hugging Face stack is a small script. The annotated version below is the actual work, not pseudo-code.

# Fine-tune Qwen2.5 with LoRA on a domain dataset.
# Runs on one GPU. Produces an adapter file you can ship.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
 
base = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
 
# LoRA config: train ~0.5% of parameters, leave the rest frozen.
lora = LoraConfig(
 r=16, lora_alpha=32, lora_dropout=0.05,
 target_modules=["q_proj",

That is the whole training job. The hard parts, the parts that decide whether the result is good, sit outside this file: the JSONL you feed in, and the evaluation script you run after.

The evaluation harness is non-negotiable

The CCL25 task ships with an evaluation set and a scoring rubric. That is why the team could iterate. Most internal AI projects skip this and then argue about whether the output "feels better" after each change. That is how budgets disappear.

For an operator, the discipline is: write the evaluation set before you train. A few hundred labeled examples your team agrees represent the work, scored by a stable rubric, with a number that goes up or down. Without it, you cannot tell a real improvement from a regression, and you cannot defend a deployment decision.

flowchart LR
 A[Domain examples] --> B[Eval set, frozen]
 A --> C[Training set]
 C --> D[LoRA fine-tune]
 D --> E[Candidate adapter]
 B --> F[Score adapter]
 E --> F
 F --> G{Better than baseline?}
 G -->|Yes| H[Ship adapter]
 G -->|No| I[Revise data, retry]
 I --> C

The loop is boring. That is the point. Boring loops are the ones that produce reliable systems and predictable budgets.

A serving sketch

Once you have an adapter, you serve it. The base model stays loaded, and the adapter is attached at request time. If you have multiple domains, you can host multiple adapters against the same base.

# Serve Qwen2.5 base with the LoRA adapter attached.
# vLLM handles batching; the adapter is loaded once at startup.
python -m vllm.entrypoints.openai.api_server \
 --model Qwen/Qwen2.5-7B-Instruct \
 --enable-lora \
 --lora-modules poetry=./qwen25-poetry-lora/final \
 --port 8000

Calls to this server look like normal OpenAI-compatible API calls, with a model name of poetry to route to the adapter. Your application code does not need to know the model is fine-tuned. Your finance team notices because the per-call cost is the GPU rental, not a per-token fee.

What this means for agentic operations

The paper is a single-task fine-tune, but it slots directly into the broader pattern of building an agent operating model: a set of specialized agents, each backed by a model or adapter that is genuinely good at one thing, coordinated by a planner.

In that picture, the LoRA adapter is the unit of specialization. You do not need a different base model per domain. You need a base model, a library of adapters, and the discipline to keep an eval set per adapter. Adding a new capability becomes a small project, not a procurement cycle.

For governance, this matters. When auditors or regulators ask why your system made a decision, "we fine-tuned an open model on this curated dataset, here is the eval score on this rubric, here is the adapter version that was live at the time" is a defensible answer. "We prompted a vendor's model" is harder to defend, because you do not control the weights or the version history.

What to copy if you are an operator

The poetry task is far from most businesses. The method is not.

Pick one task where your team has clear domain expertise and where a generic model is almost good enough but not quite.
Write the rubric and the evaluation set first, by hand, with the people who will judge the output.
Collect or generate a training set of at least a few thousand examples in the same shape.
Run a LoRA fine-tune on an open base model in the 7B to 14B range. Budget a few hundred dollars for the first run.
Compare on the eval set. Ship if it wins. Iterate on the data if it does not.

The total elapsed time for a first pass is two to six weeks, most of it spent on the data and the rubric. The model training is the short part.

LoRA Fine-Tuning Qwen2.5 for Classical Chinese Poetry Tasks