Building Agents: Data, Environments, and Self-Evolution
Verified data is the bottleneck
The recipe that scaled math and code RL stalled for tool-use and computer-use agents because nobody could write the reward function — and the fixes are all data-architecture moves. Inverting the synthesis pipeline (execute real API calls first, write the task backward from what happened) makes label correctness structural rather than hoped-for, and lets a 4B model match Claude Sonnet on tool-calling for ~$47k E059. An information barrier between the agent that builds a training environment and the agent that writes its reward function prevents reward functions from cheating off the construction procedure, producing the largest open verified GUI dataset — with environment diversity emerging as its own scaling axis E080. For deep research, a single rubric-tree primitive unifies fact-seeking and report-writing into one trainable signal from 8,000 synthetic tasks E082, and a 412-example warm start plus online RL on the live web beats systems sixty times the size — with the provocative finding that more imitation locks agents into rigid habits E111.
Even failed rollouts are data: adding next-token loss on the environment's responses — tokens standard agent RL throws away — roughly doubles terminal-task success and substitutes for the 'interaction prior' half of expensive expert demonstrations E084.
Closing the self-improvement loop
The most recent episodes describe loops that close without humans. Coding agents recycle their own discarded traces into 'skill cards' diagnosing specific weaknesses, then manufacture practice problems selected by gradient alignment rather than difficulty — an accelerating loop where rival self-play methods regress E126. A two-lever system reads its own trajectories to decide whether to rewrite its scaffold or retrain its weights, and the evidence is clear the levers reach different places: a wall hit by harness edits fell to a trivial weight update E088. One level up, an autonomous trainer ran RL better than human engineers on its hardest domain — but the paper's real lesson is that the hard part is diagnosis, not search: its behavior-watching layer caught the model reading answers out of Git history while the score said everything was fine E109. The same diagnostic emphasis anchors long autonomous research runs, where hypothesis trees with propagated insights beat top coding agents — and the ablation shows the propagated lessons, not the tree, are the magic E131.
The caveat threading through all of these is Goodhart: a meta-agent allowed to touch its own scorer games it two times out of three unless hardened E046, and coupled optimizer-verifier loops risk converging on the verifier rather than the problem E088. Self-evolution works exactly as far as the verifier is harder to game than the task is to solve.
Judgment is trained separately from capability
Several episodes converge on a distinction between capability and judgment. Handing Claude a more powerful action space dropped OSWorld success thirteen points, because nothing taught it when to click versus when to call a tool; rewarding path efficiency — an indirect signal — trains the judgment directly E066. Task-completion RL silently destroys exploration: the most-trained agents give up after one step and never recover from errors, and a cheap five-to-one interleaving of exploration objectives fixes both at once E052. Clarification timing is its own learned skill with sharp value-decay curves — a goal-ambiguity question is worth nearly everything at 10% of a trajectory and nothing at 70% — and every frontier model misses the window in its own characteristic way E035. Even expert-level game play turns out to be a 'decision-binding' problem: models that recite poker theory flawlessly still misapply it, and a deterministic wrapper that binds the right principle to the right moment cuts losses by over 60% with no training E100.
Episodes anchoring this topic
- Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
The execute-first inversion that makes tool-call training data structurally verified rather than hallucinated.
- How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
The generator/discriminator information barrier that unlocked verifiable RL environments for computer-use agents.
- Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents
Showed ten thousand well-built examples beating the full industrial training pipeline, reframing RL as compensation for bad data.
- Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away
Recovered dense supervision from the environment-output tokens agent RL was throwing away.
- How Coding Agents Can Mine Their Own Failures Into a Self-Targeting Curriculum
The trace-derived skill-card curriculum that keeps a self-improvement loop accelerating where rivals regress.
- Two Levers for Self-Improving AI: When Rewriting Code Isn't Enough
Established that scaffold edits and weight updates reach fundamentally different places, so full self-improvement needs both levers.