Topic · 20 episodes across 6 reviews
Evaluating, Serving, and Deploying Agents at Scale
Three papers tackled the infrastructure of real agents: a brutal GDP-grounded benchmark for professional software work, an OS-style scheduler for agent serving, and a security pipeline where the LLM builds the test rig instead of judging.