
How a small team outperformed frontier models on CCL25-Eval Task 5 by fine-tuning Qwen2.5 with LoRA on a custom poetry dataset.
You need one engineer who has done a fine-tune before, or who is willing to follow a tutorial carefully. The harder roles are the domain experts who define correctness and the operator who decides what to ship. Those people you already have.
For generic tasks, prompting is fine. For tasks where the value comes from knowing your domain (your contracts, your customers, your jargon), prompts plateau. Fine-tuning encodes the knowledge into the weights instead of paying to re-explain it in every request. Costs invert: high up-front, low per call.
For a LoRA adapter on a task like translation or classification, a few thousand high-quality examples is often enough to beat a much larger generic model. Quality matters more than quantity once you are past about a thousand examples. The CCL25 team's contribution was the dataset itself, which tells you where the work goes.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
You re-run the fine-tune against the new base. Because LoRA training is cheap, this is a scheduled chore rather than a project. Keep the dataset and the eval set in version control. The adapter is disposable; the data is the asset.
Open-weights fine-tuning is generally easier to defend in regulated settings than calling an external API, because you control the data path and the model version. You still need the standard controls: data lineage, eval set documentation, change logs on adapter versions, and a rollback plan. The fine-tune itself does not create new risk, but it does shift accountability onto your team, which is usually what regulators want.
The CCL25-Eval Task 5 system report is a narrow paper about translating and interpreting classical Chinese poetry. It is also a clean case study in how a small team can take a mid-sized open model, add a focused dataset, and outperform generic frontier systems on a domain task. For an operator deciding whether to fine-tune or to keep paying per token, the numbers and the method matter more than the poetry.
This post translates the system report into operator language: what the team did, what it cost in effort, and what to copy if you are running a team that needs a model to be very good at one specific thing.
The task has three parts: translate classical Chinese poems into modern Chinese, identify the emotional register, and answer comprehension questions about meaning. Frontier models do this passably. They miss on the parts that require knowing the canon: allusions, fixed metaphors, period-specific vocabulary.
The team did two things. First, they built a new training dataset specifically for the task, combining existing poetry corpora with question-answer pairs they generated and curated. Second, they fine-tuned Qwen2.5, an open-weights model from Alibaba, using LoRA (Low-Rank Adaptation, a method that trains a small set of extra parameters instead of the whole model). The result was a competitive submission on the official leaderboard.
The interesting part for operators is not the score. It is the shape of the work: a focused dataset, a small training run, an open base model, and an evaluation harness the team controlled end to end.

Full fine-tuning of a 7B parameter model means updating roughly 7 billion numbers. That requires multiple high-end GPUs, careful memory management, and a serious bill. LoRA freezes the original weights and trains a small adapter on top, often less than 1 percent of the parameter count. The practical consequences:
Here is what the cost shape looks like in practice for a domain fine-tune of this size.
| Approach | Hardware | Wall-clock time | Approx. cost | Who owns the result |
|---|---|---|---|---|
| API prompt engineering only | None | Days of iteration | Per-token fees forever | Vendor |
| Full fine-tune of 7B model | 4-8 A100s | 1-3 days | $2k-$10k per run | You, but heavy to retrain |
| LoRA fine-tune of 7B model | 1 A100 | 4-12 hours | $50-$300 per run | You, easy to retrain |
| LoRA fine-tune of 14B model | 1 H100 | 6-18 hours | $150-$600 per run | You, easy to retrain |
The numbers are illustrative, not from the paper. The point is the shape: LoRA puts domain fine-tuning inside the discretionary budget of a single team lead.
The team's headline contribution is the dataset, not the training recipe. This is the pattern that repeats across every serious domain fine-tune we see. The base model is a commodity. The proprietary data, the curation choices, and the evaluation set are the things competitors cannot copy quickly.
If you are an operator looking at a domain task (contract review, claims triage, support deflection, internal knowledge retrieval), the work that matters is not picking the model. It is:
The paper does steps 1 through 3 carefully and steps 4 through 5 mechanically. That ordering is the lesson.
Most domain fine-tunes come down to instruction, input, output triples. For the poetry task the structure looks roughly like this.
{
"instruction": "Translate the following classical Chinese poem into modern Chinese and identify its dominant emotion.",
"input": "床前明月光,疑是地上霜。举头望明月,低头思故乡。",
"output": {
"translation": "明亮的月光洒在床前,好像地上结了一层霜。抬头望着天上的明月,低头思念起故乡。",
"emotion": "homesickness",
"rationale": "The final couplet explicitly contrasts looking up at the moon with thinking of home."
}
}That schema, repeated a few thousand times with real examples and real domain expertise, is the entire training set. For your business, swap the poem for a contract clause, a support ticket, or a lab note. The shape is the same.
This is the part most operators never see, so it stays mythological. It should not. A LoRA fine-tune of Qwen2.5 using the Hugging Face stack is a small script. The annotated version below is the actual work, not pseudo-code.
# Fine-tune Qwen2.5 with LoRA on a domain dataset.
# Runs on one GPU. Produces an adapter file you can ship.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
base = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
# LoRA config: train ~0.5% of parameters, leave the rest frozen.
lora = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj",
That is the whole training job. The hard parts, the parts that decide whether the result is good, sit outside this file: the JSONL you feed in, and the evaluation script you run after.
The CCL25 task ships with an evaluation set and a scoring rubric. That is why the team could iterate. Most internal AI projects skip this and then argue about whether the output "feels better" after each change. That is how budgets disappear.
For an operator, the discipline is: write the evaluation set before you train. A few hundred labeled examples your team agrees represent the work, scored by a stable rubric, with a number that goes up or down. Without it, you cannot tell a real improvement from a regression, and you cannot defend a deployment decision.
flowchart LR
A[Domain examples] --> B[Eval set, frozen]
A --> C[Training set]
C --> D[LoRA fine-tune]
D --> E[Candidate adapter]
B --> F[Score adapter]
E --> F
F --> G{Better than baseline?}
G -->|Yes| H[Ship adapter]
G -->|No| I[Revise data, retry]
I --> CThe loop is boring. That is the point. Boring loops are the ones that produce reliable systems and predictable budgets.
Once you have an adapter, you serve it. The base model stays loaded, and the adapter is attached at request time. If you have multiple domains, you can host multiple adapters against the same base.
# Serve Qwen2.5 base with the LoRA adapter attached.
# vLLM handles batching; the adapter is loaded once at startup.
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--enable-lora \
--lora-modules poetry=./qwen25-poetry-lora/final \
--port 8000Calls to this server look like normal OpenAI-compatible API calls, with a model name of poetry to route to the adapter. Your application code does not need to know the model is fine-tuned. Your finance team notices because the per-call cost is the GPU rental, not a per-token fee.

The paper is a single-task fine-tune, but it slots directly into the broader pattern of building an agent operating model: a set of specialized agents, each backed by a model or adapter that is genuinely good at one thing, coordinated by a planner.
In that picture, the LoRA adapter is the unit of specialization. You do not need a different base model per domain. You need a base model, a library of adapters, and the discipline to keep an eval set per adapter. Adding a new capability becomes a small project, not a procurement cycle.
For governance, this matters. When auditors or regulators ask why your system made a decision, "we fine-tuned an open model on this curated dataset, here is the eval score on this rubric, here is the adapter version that was live at the time" is a defensible answer. "We prompted a vendor's model" is harder to defend, because you do not control the weights or the version history.
The poetry task is far from most businesses. The method is not.
The total elapsed time for a first pass is two to six weeks, most of it spent on the data and the rubric. The model training is the short part.