Definition
Long-horizon tasks are tasks whose solution requires many sequential decisions, often with delayed feedback — planning a research project, refactoring a large codebase, navigating a multi-day workflow. They expose every weakness of current agents because errors compound.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- Autonomous Chemical Research with Large Language Models
- Let's Verify Step by Step
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- FunSearch: Making new discoveries in mathematics using large language models
- OpenEvolve: Open-Source Implementation of AlphaEvolve