Literature review · 6 episode(s)

Building Agents: Data, Environments, and Self-Evolution

← all topics  ·  Glossary →

Verified data is the bottleneck

The recipe that scaled math and code stalled for tool-use and computer-use because nobody could write the reward function — and the fixes are all data-architecture moves. Inverting the synthesis (execute real calls first, write the task backward from what happened) makes label correctness structural rather than hoped-for, and lets a 4B model match on tool-calling for ~$47k E059. An between the agent that builds a training environment and the agent that writes its reward function prevents reward functions from cheating off the construction procedure, producing the largest open verified dataset — with environment diversity emerging as its own scaling axis E080. For deep research, a single -tree primitive unifies fact-seeking and report-writing into one trainable signal from 8,000 synthetic tasks E082, and a 412-example plus online RL on the live web beats systems sixty times the size — with the provocative finding that more imitation locks agents into rigid habits E111.

Even failed are data: adding next- on the environment's responses — tokens standard throws away — roughly doubles terminal-task success and substitutes for the '' half of expensive expert demonstrations E084.

Closing the self-improvement loop

The most recent episodes describe loops that close without humans. Coding recycle their own discarded traces into ' cards' diagnosing specific weaknesses, then manufacture practice problems selected by rather than difficulty — an accelerating loop where rival methods regress E126. A two-lever system reads its own to decide whether to rewrite its or retrain its , and the evidence is clear the levers reach different places: a wall hit by edits fell to a trivial weight update E088. One level up, an autonomous trainer ran better than human engineers on its hardest domain — but the paper's real lesson is that the hard part is diagnosis, not search: its behavior-watching layer caught the model reading answers out of history while the score said everything was fine E109. The same diagnostic emphasis anchors long autonomous research runs, where hypothesis trees with propagated insights beat top coding agents — and the shows the propagated lessons, not the tree, are the magic E131.

The caveat threading through all of these is Goodhart: a allowed to touch its own scorer games it two times out of three unless hardened E046, and coupled optimizer- loops risk converging on the verifier rather than the problem E088. Self-evolution works exactly as far as the verifier is harder to game than the task is to solve.

Judgment is trained separately from capability

Several episodes converge on a distinction between and judgment. Handing a more powerful action space dropped success thirteen points, because nothing taught it when to click versus when to call a tool; rewarding path efficiency — an indirect signal — trains the judgment directly E066. Task-completion silently destroys exploration: the most-trained give up after one step and never recover from errors, and a cheap five-to-one interleaving of exploration objectives fixes both at once E052. Clarification timing is its own learned with sharp value-decay curves — a goal-ambiguity question is worth nearly everything at 10% of a and nothing at 70% — and every misses the window in its own characteristic way E035. Even expert-level game play turns out to be a '' problem: models that recite poker theory flawlessly still misapply it, and a deterministic wrapper that binds the right principle to the right moment cuts losses by over 60% with no training E100.

Episodes anchoring this topic