Concept · 37 episode(s)

Long-Horizon Tasks

Definition

Long-horizon tasks are tasks whose solution requires many sequential decisions, often with delayed feedback — planning a research project, refactoring a large codebase, navigating a multi-day workflow. They expose every weakness of current agents because errors compound.

Episodes covering this

194
How a Robot Builds a Debugging Notebook It Can Read, Edit, and Hand to Another Robot
ASPIRE: Agentic /Skills Discovery for Robotics
Lu, Wu, Kou et al. · NVIDIA·24 min·Jul 02, 2026
192
A 32B Open Model Matched Frontier Systems By Learning to Take Notes
AutoMem: Automated Learning of Memory as a Cognitive Skill
Wu, Zhu, Zhang et al. · Stanford University·22 min·Jul 02, 2026
189
Why Phone Agents Ace the Test and Crash on Your Actual Phone
Xiaomi-GUI-0 Technical Report
Team, Qu, Luan · Xiaomi·24 min·Jul 02, 2026
183
Why You Can't Fine-Tune Foresight Into an AI Agent
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
Zhang, Zhou, Qiao et al. · Fudan University / Shanghai Innovation Institute / Tencent Youtu Lab·23 min·Jun 29, 2026
182
How a Tiny Model Too Weak to Plan Cuts a Bigger Agent's Hallucinations by 80%
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
Song, Cai · Emory University·17 min·Jun 29, 2026
173
The Free Step-Level Grader Hiding in Every RL Training Run
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Oh, Li, Park et al. · University of Wisconsin–Madison·22 min·Jun 25, 2026
165
A Free-Lunch Tweak That Lets a Tiny Agent Beat Frontier Giants
Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning
Wang, Song, Zhang et al. · Peking University·22 min·Jun 23, 2026
160
Training an AI to Take Its Own Notes, So Its Future Self Works Better
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning
Chen, Shi, Xie et al. · Alibaba Group·23 min·Jun 19, 2026
157
When an AI Coding Agent Drives a Phone Through the Terminal, No Screen Needed
Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?
Gu, Jiang, Guo et al. · Mila–Québec AI Institute / Concordia University·24 min·Jun 19, 2026
155
Why a Flawless Demo Makes a Worse Computer-Using Agent, And the Fix
Skill-Guided Continuation Distillation for GUI Agents
Fan, Yu, Shen et al. · StepFun·22 min·Jun 18, 2026
154
How a 7B Model Out-Investigates a 72B One by Choosing What to Look At
Native Active Perception as Reasoning for Omni-Modal Understanding
Xing, Xu, Wang et al. · The Chinese University of Hong Kong·21 min·Jun 18, 2026
150
Don't Kill the Loser: A Different Way to Handle Two AI Agents Colliding
CoAgent: Concurrency Control for Multi-Agent Systems
Lyu, Zhang, Wu et al. · Shanghai Jiao Tong University·32 min·Jun 16, 2026
142
Training a Tiny Model to Run the Plumbing Between an Agent and the World
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness
Wang, Wang, Taylor et al. · University of California·24 min·Jun 12, 2026
139
When Optimizing One GPU Kernel Quietly Breaks the Whole System
Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Prakriya, Hou, Gong et al. · AMD·30 min·Jun 12, 2026
131
Why Autonomous Research Agents Forget Their Own Lessons, and Arbor's Fix
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Jin, Hu, Qiu et al. · Renmin University of China·33 min·Jun 11, 2026
125
AI Coding Agents Run a Marathon, and Fewer Than One in Three Finish
SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?
Desai, Hu, Cabezas et al. · Abundant·27 min·Jun 09, 2026
123
Five Identical Worlds, One Swapped Model: What Happens When AI Agents Run for Fifteen Days
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Akkil, Kokku, Vikram et al. · Emergence AI·30 min·Jun 09, 2026
122
When Your Coding Agent Lies About the Fix: Verifying the Plan Before the Model Runs
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
Wang, Huang, Wang et al. · University of Illinois Urbana-Champaign·24 min·Jun 09, 2026
120
How an AI Agent Rewrites Its Own Tools, Without an Answer Key
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Pan, Liu, Lin et al. · City University of Hong Kong·30 min·Jun 05, 2026
115
Teaching a Phone Agent to Reason Silently, And Keeping It Honest
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Yang, Hu, Hao et al. · Beihang University·24 min·Jun 04, 2026
112
When an AI Agent Cheats Without Being Told: Inside the Meta-Agent Challenge
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
Lu, Wang, Wang et al. · Institute of Software·22 min·Jun 04, 2026
108
The Reasoning Cliff: Why Thinking Longer Makes Models Worse at Exact Step-by-Step Tasks
The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary
Guo, Wu, Yiu · The University of Hong Kong·32 min·Jun 03, 2026
105
The Trojan Is Your Agent's Memory: Why Single-Step Defenses Miss Persistent Attacks
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
Tan, Dou, Yang et al. · Gaoling School of Artificial Intelligence·26 min·Jun 01, 2026
096
How Treating an AI Agent's Execution Like Git Recovers a Coordination Penalty
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Yu, Chong, Nandi et al. · Northeastern University·22 min·May 28, 2026
095
Seven Wins to Zero: How Organizing AI Agents Like a Lab Changes the Search
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
Gao, Fang, Zitnik · Harvard University·24 min·May 28, 2026
092
When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
Fan, Wang, Chu et al. · Harbin Institute of Technology·27 min·May 28, 2026
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax · MiniMax·28 min·May 27, 2026
084
Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away
ECHO: Terminal Agents Learn World Models for Free
Shrivastava, Kauffmann, Awadallah et al. · Microsoft Research·26 min·May 26, 2026
083
Training the Translator: How a Small Communication Model Lets Agent Teams Outperform Themselves
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning
Hu, Qian, Wang et al. · GSAI·24 min·May 26, 2026
076
Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math
RMA: an Agentic System for Research-Level Mathematical Problems
Zhao, Yuan, Choi et al. · Georgia Institute of Technology·22 min·May 25, 2026
072
A Robot Made Graphene Without Help, And Caught Itself Hallucinating
Qumus: Realization of An Embodied AI Quantum Material Experimentalist
Shi, Zheng, Juan et al. · Princeton University·29 min·May 23, 2026
068
The OS Trick That Makes Tree Search Practical for Coding Agents
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
Dong, He, Hou et al. · Institute of Parallel and Distributed Systems·27 min·May 22, 2026
046
When the AI Optimizer Edits the Grade Book: Why Harnessing Evolution Needs a Wall
Harnessing Agentic Evolution
Zhang, Gu, Ruan et al. · The Hong Kong University of Science and Technology (Guangzhou) / DeepWisdom·24 min·May 15, 2026
017
When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers
Gym-Anything: Turn any Software into an Agent Environment
Aggarwal, Neubig, Welleck · CMU·31 min·May 03, 2026
008
Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Wang, Gooding, Hartmann et al. · Google DeepMind·24 min·May 02, 2026
003
How to Pick the Best of Sixteen Coding Agent Rollouts
Scaling Test-Time Compute for Agentic Coding
Kim, Yang, Niu et al. · Meta Superintelligence Labs / University of Washington·17 min·May 01, 2026
002
An AI Ran a Real Optics Lab for 21 Hours and Found a Transformer-Shaped Pattern in Light
End-to-end autonomous scientific discovery on a real optical platform
Yang, Chen, Zhao et al. · Zhejiang University·29 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.