Theme · 100 episode(s)

Evaluation & Benchmarks

Definition

Evaluation and benchmarks is the discipline of measuring AI capabilities and behaviors in a way that’s comparable across models and time. Good benchmarks are surprisingly hard to build: they need to be challenging, well-validated, hard to game, and slow to saturate.

Episodes covering this

209
How 2.6 Billion Doodles Exposed the Culture Words Quietly Delete
Billions of Sketches Reveal Hidden Cultural Variation in Human Concepts
· ·15 min·Jul 09, 2026
207
An AI Graded Its Own Math Test 94 Percent — It Actually Scored 20
More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges
· ·12 min·Jul 08, 2026
206
How Four-Second Clips Become Hours of Playable AI Soccer
Multiplayer Interactive World Models with Representation Autoencoders
· ·15 min·Jul 07, 2026
205
The Same AI, Two Labels: How the Pitch Beat the Product in 162 Sessions
Rating the Pitch, Not the Product: User Evaluations of LLMs Reflect Expectations More Than Performance
· ·13 min·Jul 07, 2026
202
How Do You Know an AI Agent Actually Refused? Check the World, Not the Words
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Feng, Lin, Wen et al. · AntGroup / Hunan Institute of Advanced Technology·18 min·Jul 06, 2026
201
One in Four NeurIPS Papers Cites a Reference That Doesn't Exist
Phantom References: Hallucinated Citations That Survive Peer Review at Top-Tier Conferences
Russinovich, Kumar, Salem · Microsoft·19 min·Jul 06, 2026
198
The Model That Knows the Answer and Can't Say It
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
Gollapudi, Gupta, Singhal et al. · UC Berkeley·17 min·Jul 03, 2026
197
Twin Problems Suggest AI Reasoning Gains Are Mostly Better Fact Recall
IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs
Abdaljalil, Serpedin, Kurban · Texas A&M University·17 min·Jul 03, 2026
196
AI Agents Reached Opposite Conclusions From the Same Data — and Passed Review
The Agentic Garden of Forking Paths
Miao, Pritchard, Zou · Stanford University·18 min·Jul 03, 2026
195
Why 'Be Careful' Does Nothing for AI Coding Agents, and What Does
Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions
Ji, Zhang, Xu et al. · Hong Kong University of Science and Technology·15 min·Jul 03, 2026
191
How One Researcher Beat GPT-5.2 and Gemini 3 by Judging Their Answers, Not Improving Them
Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2
Land · Independent Researcher·26 min·Jul 02, 2026
190
The Skill Every AI Manager Is Missing: Handing Out Exactly the Right Keys
ClawArena-Team: Benchmarking Subagent Orchestration and Dynamic Workflows in Language-Model Agents
Xiong, Ji, Qiu et al. · UNC Chapel Hill·21 min·Jul 02, 2026
189
Why Phone Agents Ace the Test and Crash on Your Actual Phone
Xiaomi-GUI-0 Technical Report
Team, Qu, Luan · Xiaomi·24 min·Jul 02, 2026
188
A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics
Moakhar, Gholami, Springer et al. · University of Maryland·20 min·Jul 02, 2026
187
An 8-Billion Agent That Beats Models 80 Times Its Size By Looking Things Up
An AI agent for treatment reasoning over a biomedical tool universe
Gao, Noori, Zhu et al. · Department of Biomedical Informatics·19 min·Jun 30, 2026
185
Aligned to Refuse, Built to Tap: When Phone Agents Know the Task Is a Crime and Do It Anyway
It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents
Sun, Chen, Zhou et al. · Fudan University·27 min·Jun 30, 2026
182
How a Tiny Model Too Weak to Plan Cuts a Bigger Agent's Hallucinations by 80%
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
Song, Cai · Emory University·17 min·Jun 29, 2026
181
How to Backpropagate Blame Through a Team of Chatbots — And When It Backfires
GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems
Yang, Alrabah, Hakkani-Tür et al. · University of Illinois Urbana-Champaign·20 min·Jun 29, 2026
180
The Bug Where Smart Assistants Read a Fact and Still Forget It
Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents
Patel · Vrin·24 min·Jun 29, 2026
178
How an AI Reviewer Learned to Stop Going Easy on AI Writing
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Iacob, Jovanović, Shen et al. · University of Cambridge·23 min·Jun 26, 2026
176
An AI Designed Its Own Psychology Studies, Then Confirmed What It Found
Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist
Jagadish, Strittmatter, Jacoby et al. · Princeton University·31 min·Jun 26, 2026
173
The Free Step-Level Grader Hiding in Every RL Training Run
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Oh, Li, Park et al. · University of Wisconsin–Madison·22 min·Jun 25, 2026
172
One Bad Token Can Sink a Model's Math, And You Can Delete It
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning
Ko, Kang, Lee · Seoul National University·22 min·Jun 25, 2026
170
When a One-Liner Beats Your Agent's Clever Verification Logic
Bayesian control for coding agents
Papamarkou, Smirnov, Mazanov et al. · PolyShape / National Technical University of Athens·26 min·Jun 24, 2026
169
Why Better Bug Reports Can Make AI Coding Agents Worse
SHERLOC: Structured Diagnostic Localization for Code Repair Agents
Tamoyan, Narenthiran, Arakelyan et al. · NVIDIA / TU Darmstadt·24 min·Jun 24, 2026
166
A Router That Beats the Frontier Models It Calls
Sakana Fugu Technical Report
Tang, Cetin, Xu et al. · Sakana AI·26 min·Jun 23, 2026
157
When an AI Coding Agent Drives a Phone Through the Terminal, No Screen Needed
Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?
Gu, Jiang, Guo et al. · Mila–Québec AI Institute / Concordia University·24 min·Jun 19, 2026
156
Why More Human Demonstrations Made a Computer-Use Agent Worse
ProCUA-SFT Technical Report
Jung, Lu, Cui et al. · NVIDIA / University of Washington·20 min·Jun 18, 2026
151
Why More Experience Made This AI Agent Worse, And How to Fix It
Not All Skills Help: Measuring and Repairing Agent Knowledge
Wang, Zhou, Liang et al. · UNC Chapel Hill·28 min·Jun 16, 2026
145
Building Forgetting Into a Language Model With One Extra Line of Code
Natively Unlearnable Large Language Models
Ghosal, Maini, Raghunathan · Carnegie Mellon University·22 min·Jun 15, 2026
144
When an AI Agent Just Copies Its Tool — And Bigger Models Copy More
When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More
Wang, Vemuri · raptorX.ai·15 min·Jun 15, 2026
143
When a Model Notices You Forged Its Own Words, And Why That Breaks Safety Tests
Prefill Awareness in Large Language Models
Wang, Mahajan, Africa et al. · Constellation / University of Wisconsin-Madison·24 min·Jun 12, 2026
133
How MiniMax Turned a Reward-Hacking Disaster Into Olympiad Gold
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Chen, Zhang, Zhang et al. · MiniMax / The Chinese University of Hong Kong·34 min·Jun 12, 2026
132
The Agent Failed — But Did the Instructions Deserve to Be Followed?
SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Gautam, Radhakrishna, Gulwani · Microsoft·30 min·Jun 11, 2026
131
Why Autonomous Research Agents Forget Their Own Lessons, and Arbor's Fix
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Jin, Hu, Qiu et al. · Renmin University of China·33 min·Jun 11, 2026
129
How a Crowd of Anonymous AI Agents Broke a 40-Year Math Record
Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries
Bianchi, Kwon, Pappu et al. · Together AI·29 min·Jun 11, 2026
127
What Diffusion Language Models Were Missing: A Map, Not an Algorithm
TextLDM: Language Modeling with Continuous Latent Diffusion
Jiang, Ren, Li et al. · JoyFuture Academy / HIT·30 min·Jun 11, 2026
125
AI Coding Agents Run a Marathon, and Fewer Than One in Three Finish
SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?
Desai, Hu, Cabezas et al. · Abundant·27 min·Jun 09, 2026
124
A Cheap Model With the Blueprints Beats Expensive Models Working Blind
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Zhong, Segal, Bercovich et al. · Carnegie Mellon University·27 min·Jun 09, 2026
123
Five Identical Worlds, One Swapped Model: What Happens When AI Agents Run for Fifteen Days
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Akkil, Kokku, Vikram et al. · Emergence AI·30 min·Jun 09, 2026
122
When Your Coding Agent Lies About the Fix: Verifying the Plan Before the Model Runs
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
Wang, Huang, Wang et al. · University of Illinois Urbana-Champaign·24 min·Jun 09, 2026
121
When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model
From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws
Chen, Wang, Liu et al. · Institute of Software·27 min·Jun 05, 2026
113
What If a Prompt Injection Never Left? Attacks That Wait in Agent Memory
What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems
Xie, Liu, Zhang et al. · Institute of Information Engineering·27 min·Jun 04, 2026
112
When an AI Agent Cheats Without Being Told: Inside the Meta-Agent Challenge
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
Lu, Wang, Wang et al. · Institute of Software·22 min·Jun 04, 2026
111
How a 4B Web Agent Beat Models 60x Its Size on 500 Demonstrations
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Yang, Wu, Chen et al. · UIUC·24 min·Jun 03, 2026
108
The Reasoning Cliff: Why Thinking Longer Makes Models Worse at Exact Step-by-Step Tasks
The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary
Guo, Wu, Yiu · The University of Hong Kong·32 min·Jun 03, 2026
104
How Making a Research Agent Smarter Quietly Makes It Leak Your Secrets
MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents
Gurung, Gella, Drouin et al. · University of Edinburgh·25 min·Jun 01, 2026
103
AI Agents Tried to Invent a Post-Human Language, And Reinvented Cherokee
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
Beltoft, Brach, Torrielli et al. · University of Southern Denmark·26 min·Jun 01, 2026
100
How a Prompt Wrapper Lets a Frontier Model Play Poker Like an Expert
PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers
Li, Wang, Huang · IIIS·29 min·May 29, 2026
097
Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents
Scaling Laws for Agent Harnesses via Effective Feedback Compute
Zhang, Wang, Xu et al. · Harbin Institute of Technology·25 min·May 29, 2026
094
Chain-of-Thought Monitoring Fails Across Languages, and Worst Where It's Needed Most
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages
Onyame, Zhou, Thopalli et al. · University of Virginia·24 min·May 28, 2026
092
When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
Fan, Wang, Chu et al. · Harbin Institute of Technology·27 min·May 28, 2026
091
When Better Fine-Tuning Can't Help: A Geometric Impossibility in LLM Causal Reasoning
Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
Roy, Parbhoo · SIRE·24 min·May 28, 2026
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax · MiniMax·28 min·May 27, 2026
089
When AI-Written Papers Read Well But the Evidence Underneath Is Broken
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
Meng, Mishra, Chen et al. · Google Cloud AI Research·32 min·May 27, 2026
087
When No Agent Reads the Whole Document: A Universal Cliff in Multi-Agent Review
A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration
Fukui · Research Institute of Criminal Psychiatry·26 min·May 27, 2026
086
Why Frozen-Weight Agents Still Get Worse Over Time
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
Zhu, Ro, Robertson et al. · The University of Texas at Austin·23 min·May 27, 2026
085
Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction
Language Models Need Sleep
Lee, McLeish, Goldstein et al. · Carnegie Mellon University·24 min·May 26, 2026
082
Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks
Xie, Lin, Wang et al. · The Ohio State University·31 min·May 26, 2026
081
When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Gai, Zeng, Baek et al. · Carnegie Mellon University·25 min·May 26, 2026
080
How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Wang, Lu, Wang et al. · The University of Hong Kong·32 min·May 26, 2026
079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
077
Reading a Model's Confidence Curve to Decide When Chain-of-Thought Is Worth It
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
Xia, Wang, Tang et al. · State Key Laboratory of General Artificial Intelligence·22 min·May 25, 2026
076
Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math
RMA: an Agentic System for Research-Level Mathematical Problems
Zhao, Yuan, Choi et al. · Georgia Institute of Technology·22 min·May 25, 2026
070
When Models Know the Answer But Say the Wrong Thing Anyway
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Yeom, Sok, Kim et al. · Graduate School of Data Science·22 min·May 22, 2026
069
When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Merrill, Lee, Karger · Forecasting Research Institute / UC Berkeley·30 min·May 22, 2026
067
An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won
Advancing Mathematics Research with AI-Driven Formal Proof Search
Tsoukalas, Kovsharov, Shirobokov et al. · Google DeepMind·31 min·May 22, 2026
065
One Loop to Optimize Them All: A Universal API for LLM-Driven Discovery
optimize_anything: A Universal API for Optimizing any Text Parameter
Agrawal, Lee, Tan et al. · UC Berkeley·27 min·May 22, 2026
062
Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Zhang, Zheng, Yang · Shenzhen University·24 min·May 20, 2026
061
When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
Jha, Triedman, Bhattacharya et al. · Cornell University·27 min·May 20, 2026
059
Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
Lu, Wang, Lu et al. · Northeastern University·22 min·May 20, 2026
058
Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
Liu, Holz, Ye et al. · University of Chinese Academy of Sciences·32 min·May 19, 2026
057
How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
ADR: An Agentic Detection System for Enterprise Agentic AI Security
Li, Hu, Xu et al. · Uber Technologies·28 min·May 19, 2026
055
Why LLM Judges Flip Their Verdicts When You Change the Question Format
Judge Circuits
Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
052
An Old Reinforcement Learning Tradeoff Sneaks Back Into LLM Agents
Look Before You Leap: Autonomous Exploration for LLM Agents
Ye, Shi, Liu et al. · University of Science and Technology of China / Meituan·23 min·May 18, 2026
051
Why Parallel Sampling Plateaus, And What Evidence Graphs Do Instead
Argus: Evidence Assembly for Scalable Deep Research Agents
Zhang, Su, Chen et al. · MiroMind AI·22 min·May 18, 2026
048
How a 30B Open Model Reached Olympiad Gold With the Right Recipe
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Li, Zhan, Zhang et al. · Shanghai AI Laboratory / The Chinese University of Hong Kong·31 min·May 16, 2026
047
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
Orchard: An Open-Source Agentic Modeling Framework
Peng, Yao, Wu et al. · Microsoft Research·28 min·May 15, 2026
046
When the AI Optimizer Edits the Grade Book: Why Harnessing Evolution Needs a Wall
Harnessing Agentic Evolution
Zhang, Gu, Ruan et al. · The Hong Kong University of Science and Technology (Guangzhou) / DeepWisdom·24 min·May 15, 2026
045
When a Frontier Model Talks Its Own Twin Into Climate Denial
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
Nogueira, Almeida, Bonás et al. · Maritaca AI·31 min·May 15, 2026
044
How One Sentence and a Forged History Flip the Most Aligned Models
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Salgado · Independent Researcher·23 min·May 15, 2026
039
When Smarter Agents Get Fooled by Three Extra Nodes in a Database
Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning
Kereopa-Yorke, Diaz, Wright et al. · Microsoft·31 min·May 12, 2026
037
Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Elbadry, Heakl, Zhang et al. · Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)·27 min·May 12, 2026
035
Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Gulati, Gupta, Lumer et al. · PricewaterhouseCoopers U.S.·29 min·May 11, 2026
034
Catching Multi-Agent Deadlocks Before Deployment With a 40-Year-Old Tool
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
Xia, Li, Ehsan et al. · Rutgers University·30 min·May 11, 2026
033
Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Sridhar, Johansen · California·24 min·May 11, 2026
031
When Your AI Assistant Won't Let Go of Old Facts About You
STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
Chao, Bai, Sheng et al. · Wuhan University·24 min·May 09, 2026
029
Why Forty-Eight Percent on FrontierMath Isn't the Real Story in DeepMind's New Math Paper
AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
Zheng, Glehn, Zwols et al. · Google DeepMind·20 min·May 08, 2026
021
Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Du, Ye, Tang et al. · Shanghai Jiao Tong University·14 min·May 06, 2026
020
The Compliance Gap: Why AI Says Yes and Does No
The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't
Shin · Polymath Minds AI Lab·28 min·May 06, 2026
019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026
018
Language Models Compute the Rational Move, Then Override It
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
Lekeas, Stamatopoulos · DreamWorks Animation·29 min·May 03, 2026
017
When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers
Gym-Anything: Turn any Software into an Agent Environment
Aggarwal, Neubig, Welleck · CMU·31 min·May 03, 2026
015
The Audit Number Isn't What You Think: Sycophancy and the Case Against Single-Prompt Bias Tests
Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor
Törnberg, Schimmel · Institute of Logic·21 min·May 03, 2026
013
Why Search Keeps Rediscovering the Same Workflow, and What That Means
Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors
Du, Liu, Du et al. · Carnegie Mellon University·22 min·May 03, 2026
011
When RL Actually Teaches Agents Something New, And When It Doesn't
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
Zhai, Yan, Shao et al. · Fudan University·23 min·May 02, 2026
010
When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL
RAGEN-2: Reasoning Collapse in Agentic RL
Wang, Gui, Jin et al. · Northwestern University·22 min·May 02, 2026
008
Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Wang, Gooding, Hartmann et al. · Google DeepMind·24 min·May 02, 2026
007
Exploration Hacking: When Models Sabotage Their Own RL Training
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Jang, Falck, Braun et al. · MATS·23 min·May 02, 2026
003
How to Pick the Best of Sixteen Coding Agent Rollouts
Scaling Test-Time Compute for Agentic Coding
Kim, Yang, Niu et al. · Meta Superintelligence Labs / University of Washington·17 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

The Political Preferences of AI
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
LoCoMo: Long-Context Modular Memory for Dialogue State Tracking
Zoology: Measuring and Improving Recall in Efficient Language Models
TLA+: A Practical Introduction to Formal Methods for Distributed Systems
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents
Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
AGENTBENCH: Evaluating LLMs as Agents
Large Language Models are not Robust Multiple Choice Selectors
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Are Emergent Abilities of Large Language Models a Mirage?
Inverse Scaling: When Bigger Isn't Better
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
WebArena: A Realistic Web Environment for Building Autonomous Agents
AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
FRAMES: Factuality Evaluation with RAG, Multi-hop Reasoning, and Answer Summarization
Corr2Cause: A Benchmark to Assess LLMs' Ability to Infer Causal Relationships from Correlational Data
Superhuman AI for multiplayer poker
Agent-as-a-Judge: Evaluate Agents with Agents
LLaDA: Large Language Diffusion with mAsking
Who's Harry Potter? Approximate Unlearning in LLMs
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference