
DRFLOW tests whether AI agents can predict the right action sequence for a specific user, not just write a good report. Here is what that means for buyers.
The benchmark itself is a research artifact. The pattern, logging workflows and scoring predictions against them, you can adopt now with the schema and scoring function above. Most teams find their internal eval set is more useful than any public benchmark within a few weeks of logging.
Robotic process automation (RPA) executes a fixed script. Workflow prediction chooses which script to run, in what order, for which user, based on context. RPA is the hands; workflow prediction is the part that decides what the hands should do next.
Not necessarily. The DRFLOW results suggest that conditioning on user history matters more than raw model scale for this task. A smaller model with good per-user context can beat a larger model that sees only the current request.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
That is the most common case and also the strongest argument for starting. The act of logging the first few hundred workflows surfaces the variation you have. You do not need a single canonical process; you need data that reflects how work is actually done.
An agentic org is one where agents take on standing roles, not just one-off prompts. Standing roles require predictable behavior, and predictable behavior requires evaluation against the work itself. Workflow prediction is the unit of evaluation that maps cleanly to a role: does the agent do the job the way the team does the job. Without that, "autonomous" is a posture, not an operating state.
If you are buying or building agents for back-office work, the headline question is simple: can the agent figure out what to do next for this specific customer, ticket, or deal? A new benchmark called DRFLOW, introduced in a recent arXiv paper, argues that today's deep research systems are mostly graded on the wrong thing, and that the gap between "writes a good memo" and "predicts the right workflow" is larger than most buyers realize.
Deep research (DR) systems are agents that browse, read, and reason over many sources before answering. The well-known examples produce long reports with citations. DRFLOW asks a different question: given a user's history and a new request, can the agent predict the correct sequence of actions, the workflow, that this specific user would take?
A workflow here is concrete. It is an ordered list of action-steps: query a system, filter results, hand off to a teammate, draft a reply, log an outcome. The benchmark scores the agent on whether it predicts the right steps in the right order for the right person.
The shift matters because most enterprise tasks are not "write me a five-page brief on lithium supply." They are "for this renewal, what do we do next, in the order our team actually does it." Report quality is a poor proxy for that.

Two account managers handling the same renewal will run different playbooks. One pulls usage data first; the other opens the contract. Both are correct for their own pipeline. A benchmark that ignores personalization will reward a generic "best practice" sequence that no actual operator follows.
DRFLOW conditions on user history. The agent sees prior workflows from the same user, then has to predict the next one. That mirrors how a real assistant would learn the team: by watching, not by reading the SOP wiki.
Here is a side-by-side view of the two evaluation styles. If you are sitting through vendor demos, this is the table to bring.
| Dimension | Report-style DR | Workflow prediction (DRFLOW) | What it means for operators |
|---|---|---|---|
| Output | Prose with citations | Ordered list of action-steps | One is a deliverable; the other is execution |
| Primary metric | Text similarity, citation coverage | Step accuracy, sequence match | Step accuracy maps to "did the work get done right" |
| Personalization signal | Usually none | User history conditions the prediction | The agent learns your team, not a generic playbook |
| Failure mode | Hallucinated facts | Wrong step, wrong order, wrong tool | Wrong step is auditable; hallucinated prose often is not |
| Buying criterion | Quality of writing | Match rate against logged workflows | Use your own logs as the test set |
The practical implication: if you only score vendors on report quality, you will pick the most fluent writer, not the agent most likely to do the work the way your team does it.
DRFLOW uses step-level and sequence-level scoring. In operator terms:
The last one is the interesting business metric. If a system shows no personalization lift, you are paying for a generic recommender wrapped in a chat box.
Consider a customer success team. The job-to-be-done: "prepare for the quarterly business review with account A." Different CSMs (customer success managers) execute this differently. Here is what a workflow prediction looks like in practice.
# A predicted workflow for a CSM preparing a QBR.
# Each step is a concrete action the agent expects the user to take next.
user_id: csm_27
request: "Prep QBR for account A"
predicted_workflow:
- step: pull_usage_metrics
tool: product_analytics
args: { account: A, window: 90d }
- step: pull_support_tickets
tool: zendesk
args: { account: A, status: [open, resolved], window: 90d }
- step: check_renewal_date
tool: salesforce
args
What makes this hard is that another CSM, csm_42, might draft the agenda first and pull metrics second. A good model conditioned on csm_42's history should flip the order. A bad model will predict the same sequence for everyone.
You can score a prediction against a logged workflow with a small script. The point of showing it is that the metric is something an operations team can read.
# Compare a predicted workflow to the workflow the user actually ran.
# Returns step accuracy and a strict sequence match.
def score_workflow(predicted, actual):
pred_steps = [s["step"] for s in predicted]
true_steps = [s["step"] for s in actual]
matched = sum(1 for s in pred_steps if s in true_steps)
step_accuracy = matched / max(len(true_steps), 1)
# Sequence match: longest common prefix, normalized.
prefix = 0
for p, t in zip(pred_steps, true_steps):
if p
If you log your team's actual workflows, you already have an eval set. You do not need DRFLOW's data to run DRFLOW's idea.
An agent operating model is the way a company organizes humans and agents around shared work. Workflow prediction sits in a specific slot in that model: the layer that decides what to do next, before the layer that executes a single tool call.
flowchart LR
A[User request] --> B[Workflow predictor]
B --> C{Confidence high?}
C -- Yes --> D[Auto-execute steps]
C -- No --> E[Suggest steps, human approves]
D --> F[Log actual workflow]
E --> F
F --> G[Update user history]
G --> BThe loop is the important part. Every executed workflow, whether the agent ran it or a human did, becomes a training and evaluation signal for the next prediction. That is what eval-driven operations means in practice: the system gets better because you are measuring the right thing on real work, not on a static benchmark.
You do not need a research lab to start. A team can begin collecting workflow data with three columns:
Two months of this data, even from a small team, gives you an eval set that is more valuable than any public benchmark, because it reflects your actual operations. When a vendor asks how you will measure them, hand them this.

The DRFLOW paper reports that systems tuned for report generation do not transfer well to workflow prediction. They produce plausible-sounding steps that are wrong in order, wrong in tool choice, or generic to the role rather than specific to the user. Personalization gives meaningful lift only when the agent is allowed to attend to user history during prediction, not just at retrieval time.
For a buyer, this translates to a few questions to ask any vendor pitching a deep research agent for operational work:
None of these are about model size. They are about whether the product is built for the job an operator actually has.
The business case for getting workflow prediction right is not a moonshot number. It is the steady removal of small decisions. A CSM who does not have to remember which dashboard to open first saves a few minutes per QBR. Across a team of forty, across a quarter, that is real time. More importantly, the workflow is logged, which means onboarding a new CSM no longer depends on shadowing.
The risk side is also concrete. A wrong step in a workflow is auditable: you can see the agent picked tool X when it should have picked tool Y. Compare that to a hallucinated paragraph in a report, where the error is buried in prose that reads well. From a governance standpoint, step-level outputs are easier to review, easier to approve, and easier to roll back.
That is the under-discussed reason workflow prediction matters for AI governance. The format of the output, an ordered list of named actions, is itself a control surface. You can require approval on certain steps. You can block others entirely. You can route high-value workflows through a human. None of that is possible when the output is a wall of text.