Agent Hive mark

Frequently asked questions

Is DRFLOW something I can run today against my own data?

The benchmark itself is a research artifact. The pattern, logging workflows and scoring predictions against them, you can adopt now with the schema and scoring function above. Most teams find their internal eval set is more useful than any public benchmark within a few weeks of logging.

How is this different from robotic process automation?

Robotic process automation (RPA) executes a fixed script. Workflow prediction chooses which script to run, in what order, for which user, based on context. RPA is the hands; workflow prediction is the part that decides what the hands should do next.

Do I need a large model to do workflow prediction well?

Not necessarily. The DRFLOW results suggest that conditioning on user history matters more than raw model scale for this task. A smaller model with good per-user context can beat a larger model that sees only the current request.

What DRFLOW actually measures

Deep research (DR) systems are agents that browse, read, and reason over many sources before answering. The well-known examples produce long reports with citations. DRFLOW asks a different question: given a user's history and a new request, can the agent predict the correct sequence of actions, the workflow, that this specific user would take?

A workflow here is concrete. It is an ordered list of action-steps: query a system, filter results, hand off to a teammate, draft a reply, log an outcome. The benchmark scores the agent on whether it predicts the right steps in the right order for the right person.

The shift matters because most enterprise tasks are not "write me a five-page brief on lithium supply." They are "for this renewal, what do we do next, in the order our team actually does it." Report quality is a poor proxy for that.

Workflow prediction versus report generation, side by side

Why personalization is the hard part

Two account managers handling the same renewal will run different playbooks. One pulls usage data first; the other opens the contract. Both are correct for their own pipeline. A benchmark that ignores personalization will reward a generic "best practice" sequence that no actual operator follows.

DRFLOW conditions on user history. The agent sees prior workflows from the same user, then has to predict the next one. That mirrors how a real assistant would learn the team: by watching, not by reading the SOP wiki.

How workflow prediction differs from report generation

Here is a side-by-side view of the two evaluation styles. If you are sitting through vendor demos, this is the table to bring.

Dimension	Report-style DR	Workflow prediction (DRFLOW)	What it means for operators
Output	Prose with citations	Ordered list of action-steps	One is a deliverable; the other is execution
Primary metric	Text similarity, citation coverage	Step accuracy, sequence match	Step accuracy maps to "did the work get done right"
Personalization signal	Usually none	User history conditions the prediction	The agent learns your team, not a generic playbook
Failure mode	Hallucinated facts	Wrong step, wrong order, wrong tool	Wrong step is auditable; hallucinated prose often is not
Buying criterion	Quality of writing	Match rate against logged workflows	Use your own logs as the test set

The practical implication: if you only score vendors on report quality, you will pick the most fluent writer, not the agent most likely to do the work the way your team does it.

The metrics, in plain language

DRFLOW uses step-level and sequence-level scoring. In operator terms:

Step accuracy: of the steps the agent predicted, how many match a step the user actually took.
Sequence accuracy: did the agent get the order right, not just the set of steps.
Personalization lift: how much better is the agent when it sees user history versus when it does not.

The last one is the interesting business metric. If a system shows no personalization lift, you are paying for a generic recommender wrapped in a chat box.

A worked example

Consider a customer success team. The job-to-be-done: "prepare for the quarterly business review with account A." Different CSMs (customer success managers) execute this differently. Here is what a workflow prediction looks like in practice.

# A predicted workflow for a CSM preparing a QBR.
# Each step is a concrete action the agent expects the user to take next.
user_id: csm_27
request: "Prep QBR for account A"
predicted_workflow:
 - step: pull_usage_metrics
 tool: product_analytics
 args: { account: A, window: 90d }
 - step: pull_support_tickets
 tool: zendesk
 args: { account: A, status: [open, resolved], window: 90d }
 - step: check_renewal_date
 tool: salesforce
 args

What makes this hard is that another CSM, csm_42, might draft the agenda first and pull metrics second. A good model conditioned on csm_42's history should flip the order. A bad model will predict the same sequence for everyone.

You can score a prediction against a logged workflow with a small script. The point of showing it is that the metric is something an operations team can read.

# Compare a predicted workflow to the workflow the user actually ran.
# Returns step accuracy and a strict sequence match.
def score_workflow(predicted, actual):
 pred_steps = [s["step"] for s in predicted]
 true_steps = [s["step"] for s in actual]
 
 matched = sum(1 for s in pred_steps if s in true_steps)
 step_accuracy = matched / max(len(true_steps), 1)
 
 # Sequence match: longest common prefix, normalized.
 prefix = 0
 for p, t in zip(pred_steps, true_steps):
 if p

If you log your team's actual workflows, you already have an eval set. You do not need DRFLOW's data to run DRFLOW's idea.

Where this fits in an agent operating model

An agent operating model is the way a company organizes humans and agents around shared work. Workflow prediction sits in a specific slot in that model: the layer that decides what to do next, before the layer that executes a single tool call.

flowchart LR
 A[User request] --> B[Workflow predictor]
 B --> C{Confidence high?}
 C -- Yes --> D[Auto-execute steps]
 C -- No --> E[Suggest steps, human approves]
 D --> F[Log actual workflow]
 E --> F
 F --> G[Update user history]
 G --> B

The loop is the important part. Every executed workflow, whether the agent ran it or a human did, becomes a training and evaluation signal for the next prediction. That is what eval-driven operations means in practice: the system gets better because you are measuring the right thing on real work, not on a static benchmark.

What to log, starting Monday

You do not need a research lab to start. A team can begin collecting workflow data with three columns:

User identifier (who did the work).
Request or trigger (what kicked it off, in free text).
Ordered list of action-steps with tool and arguments (what they actually did).

Two months of this data, even from a small team, gives you an eval set that is more valuable than any public benchmark, because it reflects your actual operations. When a vendor asks how you will measure them, hand them this.

A workflow logging table feeding an evaluation loop

What the benchmark suggests about current systems

The DRFLOW paper reports that systems tuned for report generation do not transfer well to workflow prediction. They produce plausible-sounding steps that are wrong in order, wrong in tool choice, or generic to the role rather than specific to the user. Personalization gives meaningful lift only when the agent is allowed to attend to user history during prediction, not just at retrieval time.

For a buyer, this translates to a few questions to ask any vendor pitching a deep research agent for operational work:

Can you show step-level accuracy on a sequence of actions, not just text similarity on a final answer?
Does your system condition on per-user history, and can you show the lift from doing so?
How does the system behave when two users in our team do the same job differently? Does it converge them, or does it learn both styles?
What is the failure mode when confidence is low: silent guess, abstention, or escalation to a human?

None of these are about model size. They are about whether the product is built for the job an operator actually has.

Cost, risk, and the boring upside

The business case for getting workflow prediction right is not a moonshot number. It is the steady removal of small decisions. A CSM who does not have to remember which dashboard to open first saves a few minutes per QBR. Across a team of forty, across a quarter, that is real time. More importantly, the workflow is logged, which means onboarding a new CSM no longer depends on shadowing.

The risk side is also concrete. A wrong step in a workflow is auditable: you can see the agent picked tool X when it should have picked tool Y. Compare that to a hallucinated paragraph in a report, where the error is buried in prose that reads well. From a governance standpoint, step-level outputs are easier to review, easier to approve, and easier to roll back.

That is the under-discussed reason workflow prediction matters for AI governance. The format of the output, an ordered list of named actions, is itself a control surface. You can require approval on certain steps. You can block others entirely. You can route high-value workflows through a human. None of that is possible when the output is a wall of text.

DRFLOW: Benchmark for Personalized Workflow Prediction