Topic · 20 episodes across 6 reviews

Evaluating, Serving, and Deploying Agents at Scale

← all reviews

Three papers tackled the infrastructure of real agents: a brutal GDP-grounded benchmark for professional software work, an OS-style scheduler for agent serving, and a security pipeline where the LLM builds the test rig instead of judging.

Covered in these reviews