Concept · 12 episode(s)

Agent Benchmarks

← all concepts

Definition

Agent benchmarks measure how well AI systems perform multi-step, tool-using tasks — navigating a browser, fixing a bug across a repo, completing a research task — rather than answering a one-shot question. They typically score end-to-end task completion, and their results are notoriously sensitive to scaffolding choices.

Episodes covering this

  1. 078
    Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
    Yang, Gong, Huang et al. · Microsoft·28 min·May 25, 2026
  2. 076
    Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math
    Zhao, Yuan, Choi et al. · Georgia Institute of Technology·22 min·May 25, 2026
  3. 071
    When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface
    Xu, Wen, Li · Peking University·23 min·May 22, 2026
  4. 066
    Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
    Hu, Zhang, Xu et al. · Tongyi Lab·26 min·May 22, 2026
  5. 060
    When Splitting One Model Across Three Agents Doubles Its Accuracy
    Lu, Fang, Zhong et al. · University of Georgia·26 min·May 20, 2026
  6. 059
    Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
    Lu, Wang, Lu et al. · Northeastern University·22 min·May 20, 2026
  7. 057
    How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
    Li, Hu, Xu et al. · Uber Technologies·28 min·May 19, 2026
  8. 052
    An Old Reinforcement Learning Tradeoff Sneaks Back Into LLM Agents
    Ye, Shi, Liu et al. · University of Science and Technology of China / Meituan·23 min·May 18, 2026
  9. 035
    Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment
    Gulati, Gupta, Lumer et al. · PricewaterhouseCoopers U.S.·29 min·May 11, 2026
  10. 017
    When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers
    Aggarwal, Neubig, Welleck · CMU·31 min·May 03, 2026
  11. 013
    Why Search Keeps Rediscovering the Same Workflow, and What That Means
    Du, Liu, Du et al. · Carnegie Mellon University·22 min·May 03, 2026
  12. 001
    When AI Models Quietly Protect Each Other From Shutdown
    Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.