Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
An academic lab just matched OpenAI's Deep Research on several benchmarks using only 8,000 training examples — a recipe small enough to fit on one figure. The key move is a single data structure, the rubric tree, that unifies fact-seeking and report-writing into one training signal. We dig into how it works, what it actually unlocks, and the proprietary teacher stack quietly sitting underneath the word "fully synthetic."
What you'll take away
- Why the rubric tree works as one primitive for both obscure-fact tasks and open-ended report writing — and how it doubles as a synthesis target, a supervision filter, and an RL reward
- The context condenser as an epistemic state machine: trusted facts, contradicted claims, and open leads paired with the next action
- Why supervised fine-tuning actually hurts open-ended report quality, and how RL with GRPO recovers it
- The capped reward design that prevents the agent from gaming citation credit against task completion
- Why "fully synthetic" really means "no human annotation" — and why the open-weight model is effectively a distillation of an ensemble of closed frontier models
- A 2B-parameter model that beats o3 on GAIA and HLE fact-seeking, with the asterisk that it collapses on report writing
Chapters
- 00:00What a deep research agent actually is
- 30:35The rubric tree as a unifying primitive
- 09:31The context condenser
- 11:40Mid-training, SFT, and RL
- 15:34Ablations and the alignment tax
- 19:27What didn't work
- 23:21The teacher dependency steelman
- 27:15What the paper unlocks
References in this episode
- Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge — The Ohio State group's earlier hand-crafted precursor to QUEST, where the rubric
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — Introduces the GRPO algorithm whose relative-advantage scoring QUEST uses, and w
- Tongyi DeepResearch Technical Report — Alibaba's open deep research agent, which serves as QUEST's SFT teacher and is t
- GAIA: A Benchmark for General AI Assistants — The hard-reasoning agent benchmark the episode repeatedly cites when discussing
Full transcript
Also available as a plain-text transcript page.
0:00Bella: Eight thousand training examples. That's it. A team at Ohio State, with collaborators at Amazon, took an open thirty-five-billion-parameter model, trained it on eight thousand fully synthetic research tasks, and ended up with an agent that goes head-to-head with OpenAI's Deep Research on a bunch of benchmarks — beating it on some, trading punches on others. For context: OpenAI hasn't told anyone what training data they used, how big it was, or how it was scored. So when an academic group lands roughly the same capability with a recipe small enough to fit on one figure of a paper, the question isn't whether the result holds. The question is: what's actually in those eight thousand examples?
0:46Tyler: The paper went up on arXiv on May twenty-second, twenty-twenty-six, and we're recording on May twenty-sixth — so four days later. Quick ground rules before we dig in. This episode is AI-generated. I'm Tyler, that's Bella — we're both AI voices from Eleven Labs, and the script is from Anthropic's Claude Opus 4.7. The producer isn't affiliated with Anthropic or Eleven Labs. The paper we're working through is "QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks," from Jian Xie and a long list of collaborators at Ohio State and Amazon's AGI lab in San Francisco. And the reason the size of that dataset matters — eight thousand, not eighty thousand, not eight hundred thousand — is that the recipe is small enough to be reproducible by an academic lab. Which is the whole point.
1:39Bella: Right. So let's set the stakes. When people say "deep research agent," they don't quite mean what most listeners picture when they hear "AI search." The mental model a lot of people have is retrieval-augmented generation — RAG. You ask a question, the system grabs some documents, stuffs them in the prompt, hands you an answer with maybe a citation. That's one-shot grounding. Useful, but it's basically asking a librarian for a cited paragraph. A deep research agent is a different animal. It runs an autonomous loop — sometimes hundreds of tool calls long. It decomposes the question, decides what to search for, reads pages, notices conflicts between sources, decides whether to dig deeper or pivot, takes notes, eventually writes a synthesized long-form report with citations. The analogy I keep coming back to: RAG is asking a librarian a question. Deep research is hiring a research assistant for an afternoon and getting back a memo.
2:43Tyler: And the catch the paper foregrounds is that "deep research" isn't actually one capability. It's three. You've got fact seeking — the obscure-needle-in-haystack stuff, like the benchmark question where you have to figure out which architect designed the house Salinger lived in. You've got citation grounding — every claim in your report has to link back to a verifiable source. And you've got report synthesis — producing a coherent, well-structured essay-length output that a human can actually read. The field has been treating these in isolation. Every benchmark tests one of them. Every open agent that exists is optimized for one of them. Tongyi DeepResearch — Alibaba's open agent — is great at fact-seeking, mediocre at reports. Other open systems are the reverse. So the open-weight world has a fragmented, lopsided picture, and meanwhile OpenAI and Google and Anthropic are running proprietary systems that handle all three. Bella, this is where the paper's central question lands: what would training data look like that handled all three at once?
3:56Bella: And here's the move. The dominant format for training an agent — the one Tongyi uses, the one most open recipes use — is "complex question, single verifiable answer." Find the obscure fact. Match the string. That format works for fact-seeking and you can synthesize it at scale, because checking is cheap. But it just doesn't represent most real research tasks. "Write me a report on Apple's foldable phone strategy" doesn't have one right answer. Neither does "compare the pricing strategies of these four streaming services." There's no string to match against. So the open community has been stuck — fluent at one kind of supervision, blind to the other. QUEST's central insight is a different format entirely. Instead of "right answer or wrong answer," you give the agent a hierarchical checklist of what counts as a good answer. They call it a rubric tree.
4:56Tyler: Unpack that for me. What does the tree actually look like?
5:01Bella: Think about a real college professor. A good one doesn't say "write a paper on the French Revolution and I'll grade it however I feel." They hand you a rubric on day one. Thesis statement, ten points. Uses at least three primary sources, fifteen. Addresses economic causes, ten. Addresses social causes, ten. And so on. The rubric is a tree — the root is your final grade, branches are major themes, leaves are individual checkable items. Some of those leaves are critical: if your paper isn't actually about the French Revolution, nothing else matters. That branch zeros out. That's the rubric tree, almost exactly. The QUEST pipeline generates one of these for every synthetic task. For a fact-seeking task — say, an outbreak investigation — the leaves are things like "names Boar's Head as Outbreak One" and "cites a source URL for that claim." For an open-ended report task — say, a comparison of Apple and Samsung's foldable strategies — the leaves are things like "addresses pricing strategy," "compares technical specs," "discusses market positioning." The leaves are binary checks. Internal nodes aggregate them. Critical leaves can zero out a whole branch. And here's why this is the unifying primitive — because it's the same data structure for both kinds of task. The fact-seeking tree and the open-ended tree look different at the leaves, but they're the same shape, scored by the same machinery. You can mix them in one training set. One framework, two capabilities, no fragmentation.
6:44Tyler: Okay, and the leverage that gives you is enormous, because the tree isn't just a way to evaluate a task — it's a way to synthesize one. Let me make sure I have this right. The pipeline doesn't start with a question. It starts with a topic — they pull trending keywords from Google Trends. Then they let Claude Sonnet 4.5 — and we'll come back to that choice — browse the web, extract verifiable constraints from real pages, and build the rubric tree. The question gets written from the tree, not the other way around. The tree is the ground truth. The question is a wrapper around it.
7:22Bella: Exactly. And the rubric becomes either an executable Python script — for objective tasks where you can literally run the checks — or a judge protocol for open-ended tasks. The filtering is brutal. They start with seventeen thousand candidate objective tasks and end with about six thousand after refinement, verification, and script-validity checks. The open-ended side starts with three thousand and lands at twenty-two hundred. The final eight-K dataset is what survived all of that. So now you've got training data, supervision filters — you can throw out trajectories that score badly on the rubric — and a reward signal for reinforcement learning that's fine-grained instead of binary. One primitive, three jobs. That's the move.
8:10Tyler: I want to flag the elegance and the cost of this in the same breath. The elegance is real — a rubric tree is genuinely the same primitive for "find the obscure fact" and "write a coherent essay," and that's the kind of conceptual unification that powers a lot of good machine learning work. The cost is that the rubric is generated by Claude Sonnet 4.5, the judge of the rubric is Claude Sonnet 4.5, the reference reports for open-ended evaluation are written by Claude Sonnet 4.5, the fact-checker downstream is GPT-5-mini, the eval-script generator is GPT-5. So the open thirty-five-billion-parameter model that comes out the other end is being trained, in effect, against an ensemble of frontier proprietary models acting as teacher and grader and arbiter. We'll come back to that. The paper's honest about it but it's worth holding in mind from the start. "Fully synthetic" in the title means "no human annotation." It does not mean "no proprietary dependency."
9:22Bella: That's the steelman thread, and we'll pick it up. For now let's stay on the architecture, because the second piece is just as elegant — and Tyler, you should take this one, because it's a system problem at heart.
9:38Tyler: Yeah, the context condenser is my favorite engineering move in the paper. Here's the problem. A deep research session is long. Hundreds of tool calls. Each one accumulates raw HTML, search results, reasoning traces. You blow past any model's context window in minutes. The naive solutions are bad — you can throw out old context, in which case the agent forgets what it's already learned, or you can keep only the last few turns, in which case you forget the earliest part of the investigation. Either way, you've thrown away the agent's epistemic state — what it knows, what it doubts, what it still has to check. QUEST's solution is to compress that state into a structured object with three buckets. Trusted: facts it has verified against sources, with the URLs attached. Untrusted: claims it has now seen contradicted by other sources, flagged with the reason. Uncertain: partial claims paired with a specific next action — visit this URL, run that search query. The image I want listeners to have is a detective working a long case. They've got a corkboard with three columns of index cards. Confirmed, with the source pinned to the card. Contradicted, with a note about why. Open leads, paired with the exact next step. When the desk gets too cluttered, they don't throw cards away — they transcribe the corkboard to a clean wall and keep working. That's the condenser. The desk is the context window. The corkboard is the structured state. The condensation event is moving to a clean wall.
11:18Bella: And there's a line from the condenser prompt that I think captures the whole idea — the state has to encode what is already verified and final, what is false or contradicted, and what is missing along with the exact next action to resolve it. That last part is the key. An uncertainty isn't just "I don't know yet" — it's paired with the move that would resolve it. The agent's memory isn't just a record; it's a plan.
11:46Tyler: Right. And here's the bit that's clever about how this connects to training. They train the model on what they call "sessions" — chunks of trajectory between condensation events. The model learns to pick up from a compressed state and keep going, because that's exactly what it'll do at inference. The training-time artifact and the inference-time artifact are the same shape. No extrapolation. That's the kind of detail that, if you don't think about it, you don't notice — but if you've ever tried to train an agent on long trajectories, you know that the mismatch between how the model trains and how it later operates is where things quietly fall apart.
12:29Bella: Okay. So we have the rubric tree, we have the condenser. Let's pull the training pipeline together, because the three stages do different work and the ablations are where the surprises hide. Stage one is what they call mid-training. The base model gets pre-trained for two auxiliary skills: producing the structured state from a long history, and extracting goal-relevant content from raw HTML. Neither of those skills requires new annotation — the targets are reused from the condenser itself and from a tool cache they were already keeping. So it's almost free supervision that happens to teach the base model the intermediate artifacts it'll need at inference. Stage two is supervised fine-tuning. They collect trajectories using Tongyi DeepResearch as a teacher. The teacher takes a shot at each task; the rubric tree scores the result; only the successful trajectories survive. For tasks the teacher fails on, they take the fine-grained rubric feedback, inject it back as a hint, and let the teacher retry. That's an automatic self-correction loop powered by exactly the same primitive. Stage three is reinforcement learning. GRPO, the relative-advantage method DeepSeek-R1 popularized. And here's where the reward design earns its keep.
13:49Tyler: Walk us through the reward, because this is one of those places where the math is doing real conceptual work.
13:56Bella: It's a weighted combination of two scores. The rubric score — how well did you complete the task — and a fact-checking score — were your citations actually supported. Three quarters of the reward comes from the rubric score. The remaining quarter comes from the fact-checking score, but capped by the rubric score. Capped. That's the move. The intuition: citation credit is upper-bounded by how well you actually answered the question. If you produce beautifully cited gibberish — perfect citations but the report is incoherent — your citation bonus collapses to your gibberish-level rubric score. If you complete the task brilliantly but lie about your sources, your citation bonus drops to your low fact-checking score. You can't game one signal without paying for it in the other. Imagine a research foundation funding researchers: we pay you mostly for actually solving the problem. There's a bonus for good citations. But the bonus is capped by how well you solved the problem. You can't earn the citation bonus by writing a beautifully cited paper that doesn't answer the question. That's the formula in plain English, and the cap is the whole point — it closes off a degenerate optimum that pure additive rewards would leave open.
15:18Tyler: And the GRPO piece, briefly, is what we talked about — instead of judging each attempt against an absolute reward scale, you judge it against the other attempts at the same prompt in the same batch. Graded on a curve against its siblings. This matters for deep research because the difficulty range is wild — some tasks are nearly impossible, some are easy. Absolute scoring would saturate on easy tasks and starve on hard ones. Relative scoring extracts a useful gradient from both.
15:50Bella: Okay. So that's the recipe. Mid-training, SFT, RL. Now — what happens when you run the ablations? Because this is where I think the paper earns its keep.
16:00Tyler: The headline is counterintuitive. SFT alone — supervised fine-tuning on those teacher trajectories — actually hurts open-ended report quality. The model gets worse at writing reports. RL then recovers it dramatically. So you've got this U-shape where the middle stage is a regression and the final stage is a recovery. And the mechanism, if you sit with it, makes sense. SFT teaches imitation. The model is trying to mimic the teacher's trajectory at every step. For a long, multi-step output, that's a narrow target — there are many good reports, and forcing the model to match one specific report by token is overfitting to surface form. RL, in contrast, doesn't care which report you write, as long as it scores well on the rubric. So RL lets the model find its own way to a good answer, where SFT pinned it to one path.
16:58Bella: And the related ablation is the alignment-tax finding. RL boosts open-ended benchmarks a lot. But it slightly reduces the hard-reasoning benchmarks — things like GAIA and HLE. Tyler, do you want to take this one, because I want to hear your read.
17:15Tyler: Yeah. The authors hypothesize this is the same phenomenon as the classical alignment tax from RLHF — you optimize the model for readable, well-structured output, and you narrow its distribution in a way that costs you a little raw reasoning power. They don't run the experiment that would actually test this — say, RL on the open-ended subset alone, versus RL on the objective subset alone, and see whether the tax comes from one or the other. So it remains a hypothesis. But it's an important one to surface. The whole paper's claim is that the three capabilities — fact seeking, citation grounding, report synthesis — go together. And the ablation evidence suggests that optimizing one of them trades off against another. That's not a fatal critique, but it's a real tension at the heart of the work.
18:11Bella: And then there's the small-model result, which is the one I keep coming back to. Their two-billion-parameter version — a model that fits on a laptop — scores about thirty on HLE and about seventy-three on GAIA. For comparison, OpenAI's o3 scores around twenty-five on HLE and seventy on GAIA. A two-billion-parameter open model outperforming o3 on hard-fact research tasks. That's a real product, not a research curiosity. If you're a hospital or a law firm that can't send queries to a hosted LLM for privacy reasons, you previously had no good option for an autonomous research agent. This says you might.
18:54Tyler: With one giant caveat. The two-B model is terrible at open-ended report writing. The capability that distinguishes a deep research agent from a glorified search tool in the paper's own framing — that capability collapses at small scale. So the headline "small models are strong deep research agents" is selectively true. They're strong fact-lookup agents. The "deep" part of "deep research" needs scale. I don't think the authors are hiding this — they show the numbers. But the framing is a little generous to the small models, and a careful reader should hold that asterisk.
19:33Bella: Fair. Let's pivot to the part of the paper I find genuinely refreshing — the unsuccessful attempts section. Tyler, this is your territory. They list things they tried that didn't work, in real detail. When was the last time you saw a paper do that?
19:50Tyler: It's rare, and it's the section I'd recommend reading even if you skim the rest. There are four failures they document, and one of them is the cleanest illustration of an LLM-judge problem I've seen written up. That one is pointwise scoring for open-ended evaluation. First attempt: ask an LLM judge to read a candidate report and give it a score from zero to one. Sounds reasonable. What happened? The judge handed out near-perfect scores about fifty percent of the time. They call this "high-score bias to favor the user." Without a reference to compare against, the judge has no anchor, and language models — especially helpful, RLHF-trained ones — drift generous. The analogy I find useful: course evaluations without an anchor. If you ask students to rate a course one to five with no comparison point, the ratings cluster around four-and-a-half regardless of course quality. Same phenomenon. Same fix.
20:51Bella: And the fix matters. Their second attempt was three-way scoring: show the judge the candidate report and a strong teacher's report, ask it to pick which is better, or call it a tie. That sounds robust — but the model they were training was weaker than the teacher, so it lost almost everywhere, and the score signal collapsed to nearly zero. Useless in the other direction. Their third try, the one that worked, was a relative numerical score against a reference. They give the judge both reports and ask for a score that captures the candidate's quality relative to the reference. If the candidate scores above one half, it beat the reference. If below, it lost. The signal is dense and unbiased and trainable. And notice — this is the same conceptual move as GRPO at training time. Anchor your judgment relative to a peer, not to an absolute scale. The paper essentially rediscovers, at evaluation time, the same trick it's using at training time. Which is satisfying in a way that good engineering often is.
21:59Tyler: The other failures are quick. They tried direct preference optimization on pairs of reports. Training collapsed because reports differ along too many axes at once, and pairwise preferences over long-form outputs are just noisy. They tried adding a "predict what search results you'd get" objective in mid-training. Helped when isolated, hurt when combined with the context-summarization task because the two objectives overlapped. They tried rubric-based error identification — let the model spot its own rubric failures — but without web access during that stage, the model could only spot superficial errors. None of these are catastrophic, but laid out together they tell you something about how brittle this kind of training is. Most of the moves you might intuitively reach for don't work. The recipe that survives is the one that found the narrow path.
22:54Bella: All right. Let's land on the steelman. We've gestured at this — the teacher dependency. I want to give it a proper articulation, because I think it's the most important thing a thoughtful listener should walk away holding.
23:08Tyler: Here's the analogy I keep using for this. If you trained a chess player entirely on grandmaster games, with no human coach, no hand-written exercises, you could fairly call the training data "synthetic" in the sense that no human pedagogue wrote it. But your capability ceiling is set by the grandmasters whose games you used. QUEST's data is synthetic in the sense that no human wrote a rubric, no human wrote a report, no human scored anything. But the rubrics come from Claude Sonnet 4.5. The reference reports come from Claude Sonnet 4.5. The judge that grades the training signal is Claude Sonnet 4.5. The fact-checker is GPT-5-mini. The eval-script generator is GPT-5. The teacher whose trajectories provide the SFT data is Tongyi DeepResearch, which is itself a trained agent. So what the paper actually demonstrates, in the most honest framing, is that you can distill and combine the capabilities of an ensemble of frontier proprietary models into an open-weight model with a clever data pipeline. That's a real and important claim. But it's a different claim than "you can train a frontier deep research agent from scratch." The model weights are open. The data pipeline that produced them depends on closed APIs at multiple points.
24:34Bella: And the broader pattern this points to is uncomfortable. The open-source LLM ecosystem is increasingly dependent on proprietary models to generate its training signal. This is true across the board — synthetic data work, RL reward models, judge-based evaluations. The weights end up open. The recipe behind them, in practice, isn't really reproducible without API access to closed systems. QUEST isn't pretending otherwise — and Tyler, to the team's credit, the paper is unusually clear about its teacher stack. But the title's word "fully" is doing some work that I think a careful reader should push on. "Fully synthetic" reads as "free of proprietary dependency." It's not. It's free of human annotation.
25:23Tyler: One more concrete steelman point and then we can land. They include a manual audit of fifty rubric trees out of the eight thousand. They found errors in six of them. That's a twelve percent error rate in the rubric tree quality. The authors describe this as "most generated scripts can accurately interpret task requirements" — which is technically true but soft-pedals an implied twelve-percent noise floor in the training signal. For SFT, that noise gets filtered — you only keep high-scoring trajectories. For RL, that noise becomes the reward signal you're optimizing against, and the paper doesn't analyze how RL behaves when the reward itself is noisy at that level. That's not a fatal flaw, but it's a real open question, and the kind of thing future work in this line will have to address.
26:16Bella: Okay. Pulling up. What does this paper actually unlock, and why should somebody who doesn't work on deep research agents care? Three things, I think. First, fully synthetic training data for agents that produce long-form outputs. Until now, you could synthesize fact-seeking questions, because the answer is checkable. You couldn't synthesize "write me a report" tasks at scale, because there's no clean grading signal without human-written references. The rubric tree breaks that bottleneck. And eight thousand is small enough that the next obvious move is eighty thousand, or eight hundred thousand. The capability is there; the data is the constraint that just got loosened. Second, a unified evaluation framework for agents. The whole field has been fragmented because different benchmarks reward different things, and you can't tell whether a given agent is good in general or just optimized for one benchmark shape. The rubric tree is the same primitive across task types. If this catches on as an evaluation standard, that's a substantive contribution to how the field measures progress.
27:29Tyler: And third, the structural argument. Eight thousand carefully-structured tasks beating dozens of approaches that used more data with weaker structure — that's an argument that, in agent training, the structure of supervision matters at least as much as the quantity. That's a claim with implications well beyond deep research agents. The contrast point in my head is the old "more data, more compute, more parameters" pattern of scaling. This paper is a small piece of evidence that on agent tasks, the structure of the supervision signal can be the bottleneck, not the volume. If you have the right format — the right primitive — eight thousand is enough.
28:14Bella: And the deployable-small-model story, even with the asterisk we put on it. A two-billion-parameter agent that does fact-seeking research at o3-level performance, runnable on a laptop, weights open. For a lot of organizations that can't send queries to hosted APIs — privacy reasons, regulatory reasons, infrastructure reasons — that's an actual capability that wasn't on the table a year ago.
28:41Tyler: Bella, what's your read on the durability of this? Because synthetic data pipelines have a habit of looking great in one paper and being hard to reproduce a year later.
28:52Bella: My honest read: the rubric tree concept is durable. Even if QUEST's specific recipe gets superseded — and it will — the idea of a hierarchical, auto-checkable evaluation primitive that doubles as a synthesis target and a reward signal is going to be picked up. Mind2Web 2, an earlier paper from the same Ohio State group, was the hand-crafted version. QUEST is the automated version. Somebody else will be the at-scale version. The primitive is what survives. The model weights themselves, I'm less confident about. The thirty-five-billion-parameter QUEST model is good now. It will be a footnote in a year. That's the pace of this field.
29:36Tyler: And the warning I'd leave listeners with: when the next paper in this line claims even more impressive numbers, look at the teacher stack. That's where the real capability is, and the more the proprietary teachers improve, the more the open distillates that ride on top will look like they're improving on their own. They're not, entirely. The ceiling is rising, but it's rising with the ceiling of the closed models that supervise the training. Which is fine. It's an honest fact about how the open ecosystem now works. But it's worth naming.
30:13Bella: Yeah. The honest version of "open agents are catching up" is "open agents are catching up by being downstream of closed agents that have a head start." That's still a useful catching-up. It's not the same as independent progress.
30:29Tyler: Good place to land. The contribution that I think will outlive the specific model is the rubric tree as a unifying primitive for both training and evaluation of agents that produce long-form outputs.
30:42Bella: The show notes have a link to the paper and some further reading if you want to keep pulling on this thread. And if you want the full transcript with the technical terms tagged and definitions inline, plus the cross-links to other episodes that touch the same ideas about agent training and synthetic data, that's all over at paperdive.ai.
31:06Tyler: Thanks for spending the time with us. This has been AI Papers: A Deep Dive.