
RubricsTree decomposes open-ended health agent answers into yes/no rubric trees, replacing costly physician grading with reproducible, compute-scale…
For a focused agent, fifty to two hundred questions is a working evaluation set. Cover your top intents, your known failure modes, and a handful of adversarial cases. You will grow the set; do not wait to be comprehensive before you start.
You can, but split the prompts and keep the judge prompt narrow (one leaf at a time). Same-model grading inflates scores on subtle reasoning leaves. For anything that gates a release, prefer a different model family for the judge, or a sample of human review on those leaves.
The expert writes the leaves in plain language. The engineer converts them to the YAML structure and wires the gate logic. Treat the YAML as the contract between them. If the expert cannot read the YAML, the format is wrong.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
A rubric tree is closer to a documented quality system than a typical eval. Each leaf is an inspectable claim, each release is tied to a rubric version, and each failure is traceable to a specific check. That is the documentation a regulator will ask for, regardless of whether the underlying agent is a language model or a rules engine.
No, and it should not. Reserve human review for the cases the rubric flags as borderline, for periodic audits of leaf accuracy, and for any output where a single bad response carries outsized risk. The goal is to move humans from grading everything to grading the cases where their judgment changes the outcome.
If you are evaluating whether to deploy an AI agent that talks to customers, patients, or employees in free text, you have a measurement problem before you have a deployment problem. You cannot ship what you cannot grade, and you cannot grade open-ended answers cheaply. A recent paper from the personal health agent space, RubricsTree, proposes a structure that operators in any regulated domain should be paying attention to.
Picture the workflow today. Your agent generates a paragraph of advice. A clinician reads it. They mark it good, bad, or partially correct, and write a sentence about why. You pay them per response. You need ten thousand responses graded to release a new version. The math does not work.
This is the bottleneck the RubricsTree authors call out for personal health agents (agents that combine a language model with a user's sensor data: steps, heart rate, sleep, weight). Physician annotation does not scale, and crowd workers are not qualified. The same wall exists in legal review, financial advice, claims handling, and tier-two customer support. Anywhere the output is prose and the standard is expertise, a human grader is the cost ceiling.

The instinct is to reach for a single LLM-as-judge prompt: "rate this answer 1 to 5." That works for demos and fails in production. The judge drifts. The scale compresses. Two reviewers, including the model itself on different days, disagree. You cannot defend a release decision with a number you cannot reproduce.
A rubric tree decomposes a prompt's correct answer into a hierarchy of small, checkable claims. Each leaf is a yes/no question a grader can answer with high agreement: "Did the answer mention that resting heart rate above 100 in a sedentary user warrants follow-up?" Internal nodes combine leaves into weighted aggregates. The tree is per-question, not per-domain, so the structure of correctness travels with the example.
The shift in operator stakes is direct:
graph TD
Q[Question: user asks about elevated RHR] --> A[Clinical correctness]
Q --> B[Memory use]
Q --> C[Safety and tone]
A --> A1[Mentions RHR threshold]
A --> A2[Suggests appropriate follow-up]
A --> A3[Does not diagnose]
B --> B1[References user's 7-day RHR trend]
B --> B2[Notes recent sleep data]
C --> C1[No alarmist language]
C --> C2[Recommends clinician contact if symptomatic]Each leaf returns true or false. Each internal node has a weight. A score is a weighted sum, and, more usefully, the failing leaves tell you exactly what the agent missed.
A flat checklist would be simpler. The tree matters for two operator reasons.
First, weighting. A safety leaf failing should not be averaged away by ten correct trivia leaves. A tree lets you set a category, say safety, as a gate. If any safety leaf fails, the answer fails, regardless of the rest of the score.
Second, evolution. When a new failure mode appears in production, you add a leaf under the right branch. Old responses can be re-scored against the new tree, and you can see whether last quarter's agent would have caught the new case. This is the difference between a static benchmark and a living quality system.
| Approach | Cost per response | Reproducibility | Catches new failure modes | Who can run it |
|---|---|---|---|---|
| Physician grading | High (minutes of expert time) | Medium, depends on rater | Yes, if the expert notices | Only the expert |
| Single LLM judge, 1-5 score | Low (cents) | Low, drifts across runs | No, the rubric is hidden in the prompt | Anyone, but trust is weak |
| Static checklist | Low | High for that list | No, the list is frozen | Anyone |
| RubricsTree | Low (cents per leaf, parallelizable) | High, each leaf is small | Yes, you add leaves | Anyone, with expert-authored rubrics |
You do not need to wait for the reference implementation. The pattern is small enough to prototype in an afternoon. Here is a rubric stored as YAML, the format a non-engineer on your team can review:
# A rubric tree for one clinical question. A reviewer can read and edit this directly.
question_id: rhr_elevated_sedentary
weighting: weighted_sum_with_safety_gate
nodes:
- id: clinical
weight: 0.5
children:
- id: mentions_threshold
check: "Does the answer state that resting heart rate consistently above 100 bpm is elevated?"
- id: suggests_followup
check: "Does the answer recommend speaking with a clinician?"
- id: avoids_diagnosis
check: "Does the answer avoid naming a specific condition as the cause?"
- id: memory
weight: 0.3
children:
- id: uses_trend
The grading loop is equally small. This script takes an agent response and the rubric above, asks a cheap model each leaf question, and returns the score plus the list of failed leaves:
# Scores one agent answer against one rubric tree. Returns score and the failing checks.
import yaml, json
from openai import OpenAI
client = OpenAI()
def grade_leaf(question_text, agent_answer, check):
prompt = f"""Question asked: {question_text}
Agent answer: {agent_answer}
Check: {check}
Answer with JSON: {{"pass": true|false, "reason": "..."}}"""
r = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"
What this gives an operator: a per-question pass rate, a list of which checks fail most often across a release, and a regression view when you change the model or the prompt.

The hardest lesson from teams running this pattern in production: the rubric is the product. Your agent will change weekly. Your evaluation set will grow monthly. If the rubric is a Google Doc, you will lose track of what passed against which version, and you will not be able to defend a deployment decision to a regulator or a board.
A working setup looks like this:
# A simple gate in CI: a rubric change must not silently lower scores on the gold set.
python score_all.py --rubrics rubrics/ --responses gold_responses.jsonl \
--baseline scores_main.json --current scores_pr.json
python compare_scores.py scores_main.json scores_pr.json --max-regression 0.02The bash above is what blocks a careless rubric edit from shipping. Two percent regression is a starting threshold, not a rule.
The RubricsTree approach is not free of failure modes. Three are worth naming for any operator considering it.
The judge model can be wrong on a leaf. The mitigation is to keep each leaf so specific that disagreement is rare, and to sample, say, five percent of leaf grades for human spot-check each week. If a leaf has more than ten percent disagreement with a human, rewrite it.
The rubric can encode bias. If the expert who wrote it has a blind spot, every answer will be graded with that blind spot. The mitigation is to have at least two experts sign off on safety-gate leaves, and to track demographic slices of pass rates separately. An agent that scores ninety percent overall and seventy percent on one subgroup is not ready.
The tree can grow unmaintainable. Hundreds of leaves per question is a smell. When that happens, the question was too broad and should be split. A useful cap: if a rubric file is over a hundred lines, decompose the underlying question.
The personal health agent is a convenient setting for this work because the stakes are high and the experts are expensive. The structural idea, decomposing an open-ended judgment into a versioned tree of small checks, applies anywhere your agent produces prose that a human currently has to read. Legal contract review, insurance claim explanations, internal HR policy questions, financial planning summaries: all have the same shape.
The shift for operators is from "we will hire reviewers" to "we will hire one expert to write rubrics, and the reviewers become exception handlers." That is a different cost curve, a different hiring plan, and a different governance story to take to your audit committee.