Concept · 38 episode(s)

Agent Benchmarks

Definition

Agent benchmarks measure how well AI systems perform multi-step, tool-using tasks — navigating a browser, fixing a bug across a repo, completing a research task — rather than answering a one-shot question. They typically score end-to-end task completion, and their results are notoriously sensitive to scaffolding choices.

Episodes covering this

202
How Do You Know an AI Agent Actually Refused? Check the World, Not the Words
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Feng, Lin, Wen et al. · AntGroup / Hunan Institute of Advanced Technology·18 min·Jul 06, 2026
195
Why 'Be Careful' Does Nothing for AI Coding Agents, and What Does
Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions
Ji, Zhang, Xu et al. · Hong Kong University of Science and Technology·15 min·Jul 03, 2026
191
How One Researcher Beat GPT-5.2 and Gemini 3 by Judging Their Answers, Not Improving Them
Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2
Land · Independent Researcher·26 min·Jul 02, 2026
190
The Skill Every AI Manager Is Missing: Handing Out Exactly the Right Keys
ClawArena-Team: Benchmarking Subagent Orchestration and Dynamic Workflows in Language-Model Agents
Xiong, Ji, Qiu et al. · UNC Chapel Hill·21 min·Jul 02, 2026
189
Why Phone Agents Ace the Test and Crash on Your Actual Phone
Xiaomi-GUI-0 Technical Report
Team, Qu, Luan · Xiaomi·24 min·Jul 02, 2026
186
How a Frozen Model Went From 2% to 77% on Physics Puzzles — Without Retraining
Hierarchical Experimentalist Agents
Chandra, Vaidyanathan, Dhanuka et al. · University of Massachusetts Amherst·22 min·Jun 30, 2026
168
When Turning Experience Into Code Makes Your AI Agent Dumber
Metis: Bridging Text and Code Memory for Self-Evolving Agents
Dai, He, Li et al. · The Chinese University of Hong Kong·27 min·Jun 24, 2026
165
A Free-Lunch Tweak That Lets a Tiny Agent Beat Frontier Giants
Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning
Wang, Song, Zhang et al. · Peking University·22 min·Jun 23, 2026
164
The Summarizer That Quietly Deletes Your Agent's Safety Rules
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents
Chen · Beijing Institute of Technology·28 min·Jun 23, 2026
157
When an AI Coding Agent Drives a Phone Through the Terminal, No Screen Needed
Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?
Gu, Jiang, Guo et al. · Mila–Québec AI Institute / Concordia University·24 min·Jun 19, 2026
156
Why More Human Demonstrations Made a Computer-Use Agent Worse
ProCUA-SFT Technical Report
Jung, Lu, Cui et al. · NVIDIA / University of Washington·20 min·Jun 18, 2026
155
Why a Flawless Demo Makes a Worse Computer-Using Agent, And the Fix
Skill-Guided Continuation Distillation for GUI Agents
Fan, Yu, Shen et al. · StepFun·22 min·Jun 18, 2026
150
Don't Kill the Loser: A Different Way to Handle Two AI Agents Colliding
CoAgent: Concurrency Control for Multi-Agent Systems
Lyu, Zhang, Wu et al. · Shanghai Jiao Tong University·32 min·Jun 16, 2026
132
The Agent Failed — But Did the Instructions Deserve to Be Followed?
SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Gautam, Radhakrishna, Gulwani · Microsoft·30 min·Jun 11, 2026
125
AI Coding Agents Run a Marathon, and Fewer Than One in Three Finish
SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?
Desai, Hu, Cabezas et al. · Abundant·27 min·Jun 09, 2026
124
A Cheap Model With the Blueprints Beats Expensive Models Working Blind
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Zhong, Segal, Bercovich et al. · Carnegie Mellon University·27 min·Jun 09, 2026
123
Five Identical Worlds, One Swapped Model: What Happens When AI Agents Run for Fifteen Days
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Akkil, Kokku, Vikram et al. · Emergence AI·30 min·Jun 09, 2026
121
When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model
From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws
Chen, Wang, Liu et al. · Institute of Software·27 min·Jun 05, 2026
120
How an AI Agent Rewrites Its Own Tools, Without an Answer Key
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Pan, Liu, Lin et al. · City University of Hong Kong·30 min·Jun 05, 2026
119
Beating Reinforcement Learning Without Ever Touching the Model's Weights
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
Hwang, Suri, Villecroze et al. · Layer6 AI·22 min·Jun 05, 2026
117
How an Open AI System Verified 672 Hard Math Proofs for Under $300
Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement
Chung, Cai, Li et al. · Princeton University·26 min·Jun 05, 2026
115
Teaching a Phone Agent to Reason Silently, And Keeping It Honest
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Yang, Hu, Hao et al. · Beihang University·24 min·Jun 04, 2026
112
When an AI Agent Cheats Without Being Told: Inside the Meta-Agent Challenge
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
Lu, Wang, Wang et al. · Institute of Software·22 min·Jun 04, 2026
105
The Trojan Is Your Agent's Memory: Why Single-Step Defenses Miss Persistent Attacks
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
Tan, Dou, Yang et al. · Gaoling School of Artificial Intelligence·26 min·Jun 01, 2026
092
When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
Fan, Wang, Chu et al. · Harbin Institute of Technology·27 min·May 28, 2026
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax · MiniMax·28 min·May 27, 2026
078
Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Yang, Gong, Huang et al. · Microsoft·28 min·May 25, 2026
076
Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math
RMA: an Agentic System for Research-Level Mathematical Problems
Zhao, Yuan, Choi et al. · Georgia Institute of Technology·22 min·May 25, 2026
071
When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Xu, Wen, Li · Peking University·23 min·May 22, 2026
066
Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
Hu, Zhang, Xu et al. · Tongyi Lab·26 min·May 22, 2026
060
When Splitting One Model Across Three Agents Doubles Its Accuracy
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
Lu, Fang, Zhong et al. · University of Georgia·26 min·May 20, 2026
059
Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
Lu, Wang, Lu et al. · Northeastern University·22 min·May 20, 2026
057
How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
ADR: An Agentic Detection System for Enterprise Agentic AI Security
Li, Hu, Xu et al. · Uber Technologies·28 min·May 19, 2026
052
An Old Reinforcement Learning Tradeoff Sneaks Back Into LLM Agents
Look Before You Leap: Autonomous Exploration for LLM Agents
Ye, Shi, Liu et al. · University of Science and Technology of China / Meituan·23 min·May 18, 2026
035
Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Gulati, Gupta, Lumer et al. · PricewaterhouseCoopers U.S.·29 min·May 11, 2026
017
When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers
Gym-Anything: Turn any Software into an Agent Environment
Aggarwal, Neubig, Welleck · CMU·31 min·May 03, 2026
013
Why Search Keeps Rediscovering the Same Workflow, and What That Means
Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors
Du, Liu, Du et al. · Carnegie Mellon University·22 min·May 03, 2026
001
When AI Models Quietly Protect Each Other From Shutdown
Peer-Preservation in Frontier Models
Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.