Agent Hive mark

Frequently asked questions

How many rubric questions do we need to start?

For a focused agent, fifty to two hundred questions is a working evaluation set. Cover your top intents, your known failure modes, and a handful of adversarial cases. You will grow the set; do not wait to be comprehensive before you start.

Can we use the same model as both the agent and the judge?

You can, but split the prompts and keep the judge prompt narrow (one leaf at a time). Same-model grading inflates scores on subtle reasoning leaves. For anything that gates a release, prefer a different model family for the judge, or a sample of human review on those leaves.

Who writes the rubric, the expert or the engineer?

The expert writes the leaves in plain language. The engineer converts them to the YAML structure and wires the gate logic. Treat the YAML as the contract between them. If the expert cannot read the YAML, the format is wrong.

The evaluation bottleneck, in operator terms

Picture the workflow today. Your agent generates a paragraph of advice. A clinician reads it. They mark it good, bad, or partially correct, and write a sentence about why. You pay them per response. You need ten thousand responses graded to release a new version. The math does not work.

This is the bottleneck the RubricsTree authors call out for personal health agents (agents that combine a language model with a user's sensor data: steps, heart rate, sleep, weight). Physician annotation does not scale, and crowd workers are not qualified. The same wall exists in legal review, financial advice, claims handling, and tier-two customer support. Anywhere the output is prose and the standard is expertise, a human grader is the cost ceiling.

The instinct is to reach for a single LLM-as-judge prompt: "rate this answer 1 to 5." That works for demos and fails in production. The judge drifts. The scale compresses. Two reviewers, including the model itself on different days, disagree. You cannot defend a release decision with a number you cannot reproduce.

What RubricsTree actually does

A rubric tree decomposes a prompt's correct answer into a hierarchy of small, checkable claims. Each leaf is a yes/no question a grader can answer with high agreement: "Did the answer mention that resting heart rate above 100 in a sedentary user warrants follow-up?" Internal nodes combine leaves into weighted aggregates. The tree is per-question, not per-domain, so the structure of correctness travels with the example.

The shift in operator stakes is direct:

A clinician spends ten minutes writing the rubric once, then never grades that question again.
A cheap model fills in the yes/no leaves at compute cost.
Disagreement between graders drops because the question is smaller.
New failure modes get added as new leaves. The rubric grows; old runs can be re-scored.

The tree, drawn out

graph TD
 Q[Question: user asks about elevated RHR] --> A[Clinical correctness]
 Q --> B[Memory use]
 Q --> C[Safety and tone]
 A --> A1[Mentions RHR threshold]
 A --> A2[Suggests appropriate follow-up]
 A --> A3[Does not diagnose]
 B --> B1[References user's 7-day RHR trend]
 B --> B2[Notes recent sleep data]
 C --> C1[No alarmist language]
 C --> C2[Recommends clinician contact if symptomatic]

Each leaf returns true or false. Each internal node has a weight. A score is a weighted sum, and, more usefully, the failing leaves tell you exactly what the agent missed.

Why a tree, not a checklist

A flat checklist would be simpler. The tree matters for two operator reasons.

First, weighting. A safety leaf failing should not be averaged away by ten correct trivia leaves. A tree lets you set a category, say safety, as a gate. If any safety leaf fails, the answer fails, regardless of the rest of the score.

Second, evolution. When a new failure mode appears in production, you add a leaf under the right branch. Old responses can be re-scored against the new tree, and you can see whether last quarter's agent would have caught the new case. This is the difference between a static benchmark and a living quality system.

Approach	Cost per response	Reproducibility	Catches new failure modes	Who can run it
Physician grading	High (minutes of expert time)	Medium, depends on rater	Yes, if the expert notices	Only the expert
Single LLM judge, 1-5 score	Low (cents)	Low, drifts across runs	No, the rubric is hidden in the prompt	Anyone, but trust is weak
Static checklist	Low	High for that list	No, the list is frozen	Anyone
RubricsTree	Low (cents per leaf, parallelizable)	High, each leaf is small	Yes, you add leaves	Anyone, with expert-authored rubrics

A minimal version you can build this week

You do not need to wait for the reference implementation. The pattern is small enough to prototype in an afternoon. Here is a rubric stored as YAML, the format a non-engineer on your team can review:

# A rubric tree for one clinical question. A reviewer can read and edit this directly.
question_id: rhr_elevated_sedentary
weighting: weighted_sum_with_safety_gate
nodes:
 - id: clinical
 weight: 0.5
 children:
 - id: mentions_threshold
 check: "Does the answer state that resting heart rate consistently above 100 bpm is elevated?"
 - id: suggests_followup
 check: "Does the answer recommend speaking with a clinician?"
 - id: avoids_diagnosis
 check: "Does the answer avoid naming a specific condition as the cause?"
 - id: memory
 weight: 0.3
 children:
 - id: uses_trend

The grading loop is equally small. This script takes an agent response and the rubric above, asks a cheap model each leaf question, and returns the score plus the list of failed leaves:

# Scores one agent answer against one rubric tree. Returns score and the failing checks.
import yaml, json
from openai import OpenAI
 
client = OpenAI()
 
def grade_leaf(question_text, agent_answer, check):
 prompt = f"""Question asked: {question_text}
Agent answer: {agent_answer}
Check: {check}
Answer with JSON: {{"pass": true|false, "reason": "..."}}"""
 r = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": prompt}],
 response_format={"type": "json_object"

What this gives an operator: a per-question pass rate, a list of which checks fail most often across a release, and a regression view when you change the model or the prompt.

Treating the rubric as a versioned asset

The hardest lesson from teams running this pattern in production: the rubric is the product. Your agent will change weekly. Your evaluation set will grow monthly. If the rubric is a Google Doc, you will lose track of what passed against which version, and you will not be able to defend a deployment decision to a regulator or a board.

A working setup looks like this:

Rubrics live in a Git repository, one YAML file per question, reviewed by the relevant expert.
A pull request that edits a rubric runs all historical agent responses against the new tree, and prints the score delta.
Releases are tagged with the rubric commit hash, so any audit can reproduce the exact grade.
New production failures become tickets that close with a new leaf and a new test case.
Once a quarter, the expert reviews the leaves that always pass and the leaves that always fail. Constant-pass leaves are usually redundant. Constant-fail leaves usually point at a structural agent weakness, not an evaluation issue.

# A simple gate in CI: a rubric change must not silently lower scores on the gold set.
python score_all.py --rubrics rubrics/ --responses gold_responses.jsonl \
 --baseline scores_main.json --current scores_pr.json
python compare_scores.py scores_main.json scores_pr.json --max-regression 0.02

The bash above is what blocks a careless rubric edit from shipping. Two percent regression is a starting threshold, not a rule.

Where this breaks, and how to handle it

The RubricsTree approach is not free of failure modes. Three are worth naming for any operator considering it.

The judge model can be wrong on a leaf. The mitigation is to keep each leaf so specific that disagreement is rare, and to sample, say, five percent of leaf grades for human spot-check each week. If a leaf has more than ten percent disagreement with a human, rewrite it.

The rubric can encode bias. If the expert who wrote it has a blind spot, every answer will be graded with that blind spot. The mitigation is to have at least two experts sign off on safety-gate leaves, and to track demographic slices of pass rates separately. An agent that scores ninety percent overall and seventy percent on one subgroup is not ready.

The tree can grow unmaintainable. Hundreds of leaves per question is a smell. When that happens, the question was too broad and should be split. A useful cap: if a rubric file is over a hundred lines, decompose the underlying question.

What to take from the paper, even if you are not in health

The personal health agent is a convenient setting for this work because the stakes are high and the experts are expensive. The structural idea, decomposing an open-ended judgment into a versioned tree of small checks, applies anywhere your agent produces prose that a human currently has to read. Legal contract review, insurance claim explanations, internal HR policy questions, financial planning summaries: all have the same shape.

The shift for operators is from "we will hire reviewers" to "we will hire one expert to write rubrics, and the reviewers become exception handlers." That is a different cost curve, a different hiring plan, and a different governance story to take to your audit committee.

RubricsTree: Scalable Evaluation for Health Agents