Theme · 100 episode(s)

Agentic AI

Definition

Agentic AI refers to AI systems that take goal-directed actions over multiple steps in some environment — calling tools, browsing the web, editing files — rather than producing a single response to a single prompt. The shift introduces a new class of risks around autonomy, long horizons, and irreversible actions.

Episodes covering this

202
How Do You Know an AI Agent Actually Refused? Check the World, Not the Words
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Feng, Lin, Wen et al. · AntGroup / Hunan Institute of Advanced Technology·18 min·Jul 06, 2026
200
The One Mechanism That Turns Twenty AI Clones Into an Actual Team
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
Zhang, Xu, Dai et al. · Oregon State University; AG2AI·19 min·Jul 04, 2026
196
AI Agents Reached Opposite Conclusions From the Same Data — and Passed Review
The Agentic Garden of Forking Paths
Miao, Pritchard, Zou · Stanford University·18 min·Jul 03, 2026
194
How a Robot Builds a Debugging Notebook It Can Read, Edit, and Hand to Another Robot
ASPIRE: Agentic /Skills Discovery for Robotics
Lu, Wu, Kou et al. · NVIDIA·24 min·Jul 02, 2026
192
A 32B Open Model Matched Frontier Systems By Learning to Take Notes
AutoMem: Automated Learning of Memory as a Cognitive Skill
Wu, Zhu, Zhang et al. · Stanford University·22 min·Jul 02, 2026
190
The Skill Every AI Manager Is Missing: Handing Out Exactly the Right Keys
ClawArena-Team: Benchmarking Subagent Orchestration and Dynamic Workflows in Language-Model Agents
Xiong, Ji, Qiu et al. · UNC Chapel Hill·21 min·Jul 02, 2026
189
Why Phone Agents Ace the Test and Crash on Your Actual Phone
Xiaomi-GUI-0 Technical Report
Team, Qu, Luan · Xiaomi·24 min·Jul 02, 2026
188
A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics
Moakhar, Gholami, Springer et al. · University of Maryland·20 min·Jul 02, 2026
187
An 8-Billion Agent That Beats Models 80 Times Its Size By Looking Things Up
An AI agent for treatment reasoning over a biomedical tool universe
Gao, Noori, Zhu et al. · Department of Biomedical Informatics·19 min·Jun 30, 2026
186
How a Frozen Model Went From 2% to 77% on Physics Puzzles — Without Retraining
Hierarchical Experimentalist Agents
Chandra, Vaidyanathan, Dhanuka et al. · University of Massachusetts Amherst·22 min·Jun 30, 2026
185
Aligned to Refuse, Built to Tap: When Phone Agents Know the Task Is a Crime and Do It Anyway
It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents
Sun, Chen, Zhou et al. · Fudan University·27 min·Jun 30, 2026
183
Why You Can't Fine-Tune Foresight Into an AI Agent
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
Zhang, Zhou, Qiao et al. · Fudan University / Shanghai Innovation Institute / Tencent Youtu Lab·23 min·Jun 29, 2026
181
How to Backpropagate Blame Through a Team of Chatbots — And When It Backfires
GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems
Yang, Alrabah, Hakkani-Tür et al. · University of Illinois Urbana-Champaign·20 min·Jun 29, 2026
176
An AI Designed Its Own Psychology Studies, Then Confirmed What It Found
Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist
Jagadish, Strittmatter, Jacoby et al. · Princeton University·31 min·Jun 26, 2026
175
One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Shportko, Bhokare, AlZahrani et al. · Northwestern University·26 min·Jun 26, 2026
168
When Turning Experience Into Code Makes Your AI Agent Dumber
Metis: Bridging Text and Code Memory for Self-Evolving Agents
Dai, He, Li et al. · The Chinese University of Hong Kong·27 min·Jun 24, 2026
167
How Teaching an AI to Predict, Not Act, Made It a Better Actor
Qwen-AgentWorld: Language World Models for General Agents
Team, Zuo, Xiao et al. · ·27 min·Jun 24, 2026
166
A Router That Beats the Frontier Models It Calls
Sakana Fugu Technical Report
Tang, Cetin, Xu et al. · Sakana AI·26 min·Jun 23, 2026
165
A Free-Lunch Tweak That Lets a Tiny Agent Beat Frontier Giants
Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning
Wang, Song, Zhang et al. · Peking University·22 min·Jun 23, 2026
164
The Summarizer That Quietly Deletes Your Agent's Safety Rules
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents
Chen · Beijing Institute of Technology·28 min·Jun 23, 2026
161
A Robot That Plays Before You Give It a Job, And Why That Beats Retrying
Playful Agentic Robot Learning
Zhang, Ge, Yoo et al. · University of California·19 min·Jun 19, 2026
160
Training an AI to Take Its Own Notes, So Its Future Self Works Better
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning
Chen, Shi, Xie et al. · Alibaba Group·23 min·Jun 19, 2026
159
Can a Coding Agent Run Its Own Robot Experiments Overnight, With No Human Resetting the Scene?
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Xiao, Xie, Zhang et al. · NVIDIA·23 min·Jun 19, 2026
157
When an AI Coding Agent Drives a Phone Through the Terminal, No Screen Needed
Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?
Gu, Jiang, Guo et al. · Mila–Québec AI Institute / Concordia University·24 min·Jun 19, 2026
156
Why More Human Demonstrations Made a Computer-Use Agent Worse
ProCUA-SFT Technical Report
Jung, Lu, Cui et al. · NVIDIA / University of Washington·20 min·Jun 18, 2026
155
Why a Flawless Demo Makes a Worse Computer-Using Agent, And the Fix
Skill-Guided Continuation Distillation for GUI Agents
Fan, Yu, Shen et al. · StepFun·22 min·Jun 18, 2026
154
How a 7B Model Out-Investigates a 72B One by Choosing What to Look At
Native Active Perception as Reasoning for Omni-Modal Understanding
Xing, Xu, Wang et al. · The Chinese University of Hong Kong·21 min·Jun 18, 2026
151
Why More Experience Made This AI Agent Worse, And How to Fix It
Not All Skills Help: Measuring and Repairing Agent Knowledge
Wang, Zhou, Liang et al. · UNC Chapel Hill·28 min·Jun 16, 2026
147
Agents Fail at the Body, Not the Brain: A Self-Rewriting Scaffold That Lifts a 9B Model 44 Points
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
Chen, Lu, Zhao et al. · ·30 min·Jun 15, 2026
146
How an Innocent README Can Freeze an AI Agent's Safety Check for an Hour
From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails
Zhou, Wang, Ma et al. · Hong Kong University of Science and Technology·26 min·Jun 15, 2026
144
When an AI Agent Just Copies Its Tool — And Bigger Models Copy More
When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More
Wang, Vemuri · raptorX.ai·15 min·Jun 15, 2026
142
Training a Tiny Model to Run the Plumbing Between an Agent and the World
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness
Wang, Wang, Taylor et al. · University of California·24 min·Jun 12, 2026
139
When Optimizing One GPU Kernel Quietly Breaks the Whole System
Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Prakriya, Hou, Gong et al. · AMD·30 min·Jun 12, 2026
132
The Agent Failed — But Did the Instructions Deserve to Be Followed?
SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Gautam, Radhakrishna, Gulwani · Microsoft·30 min·Jun 11, 2026
131
Why Autonomous Research Agents Forget Their Own Lessons, and Arbor's Fix
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Jin, Hu, Qiu et al. · Renmin University of China·33 min·Jun 11, 2026
125
AI Coding Agents Run a Marathon, and Fewer Than One in Three Finish
SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?
Desai, Hu, Cabezas et al. · Abundant·27 min·Jun 09, 2026
123
Five Identical Worlds, One Swapped Model: What Happens When AI Agents Run for Fifteen Days
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Akkil, Kokku, Vikram et al. · Emergence AI·30 min·Jun 09, 2026
120
How an AI Agent Rewrites Its Own Tools, Without an Answer Key
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Pan, Liu, Lin et al. · City University of Hong Kong·30 min·Jun 05, 2026
119
Beating Reinforcement Learning Without Ever Touching the Model's Weights
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
Hwang, Suri, Villecroze et al. · Layer6 AI·22 min·Jun 05, 2026
114
Agents That Rewrite Their Own Weights Instead of Just Taking Notes
Scaling Self-Evolving Agents via Parametric Memory
Ren, Luo, Yang et al. · Peking University / Alibaba Group·26 min·Jun 04, 2026
113
What If a Prompt Injection Never Left? Attacks That Wait in Agent Memory
What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems
Xie, Liu, Zhang et al. · Institute of Information Engineering·27 min·Jun 04, 2026
112
When an AI Agent Cheats Without Being Told: Inside the Meta-Agent Challenge
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
Lu, Wang, Wang et al. · Institute of Software·22 min·Jun 04, 2026
111
How a 4B Web Agent Beat Models 60x Its Size on 500 Demonstrations
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Yang, Wu, Chen et al. · UIUC·24 min·Jun 03, 2026
110
How an Agent Got 44 Points Better by Mining Its Own Scratch Paper
Inducing Reasoning Primitives from Agent Traces
Lei, Yan, Momo et al. · Carnegie Mellon University·27 min·Jun 03, 2026
109
An AI Got Caught Reading the Answer Key, And Why That Catch Matters
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Chen, Shi, Li et al. · Shenzhen Institutes of Advanced Technology·28 min·Jun 03, 2026
105
The Trojan Is Your Agent's Memory: Why Single-Step Defenses Miss Persistent Attacks
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
Tan, Dou, Yang et al. · Gaoling School of Artificial Intelligence·26 min·Jun 01, 2026
104
How Making a Research Agent Smarter Quietly Makes It Leak Your Secrets
MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents
Gurung, Gella, Drouin et al. · University of Edinburgh·25 min·Jun 01, 2026
102
How to Catch an AI Attack That No Single Conversation Reveals
Stateful Online Monitoring Catches Distributed Agent Attacks
Brown, Bhargav, Santhanam et al. · University of Pennsylvania·24 min·Jun 01, 2026
101
Treating Math Formalization Like a Codebase, and Where the Agents Cheat
Formalizing Mathematics at Scale
Rammal, Patel, Gloeckle et al. · FAIR at Meta / CERMICS·27 min·May 29, 2026
100
How a Prompt Wrapper Lets a Frontier Model Play Poker Like an Expert
PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers
Li, Wang, Huang · IIIS·29 min·May 29, 2026
097
Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents
Scaling Laws for Agent Harnesses via Effective Feedback Compute
Zhang, Wang, Xu et al. · Harbin Institute of Technology·25 min·May 29, 2026
096
How Treating an AI Agent's Execution Like Git Recovers a Coordination Penalty
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Yu, Chong, Nandi et al. · Northeastern University·22 min·May 28, 2026
091
When Better Fine-Tuning Can't Help: A Geometric Impossibility in LLM Causal Reasoning
Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
Roy, Parbhoo · SIRE·24 min·May 28, 2026
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax · MiniMax·28 min·May 27, 2026
089
When AI-Written Papers Read Well But the Evidence Underneath Is Broken
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
Meng, Mishra, Chen et al. · Google Cloud AI Research·32 min·May 27, 2026
088
Two Levers for Self-Improving AI: When Rewriting Code Isn't Enough
SIA: Self Improving AI with Harness & Weight Updates
Hebbar, Manawat, Verboomen et al. · Hexo Labs·25 min·May 27, 2026
083
Training the Translator: How a Small Communication Model Lets Agent Teams Outperform Themselves
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning
Hu, Qian, Wang et al. · GSAI·24 min·May 26, 2026
082
Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks
Xie, Lin, Wang et al. · The Ohio State University·31 min·May 26, 2026
080
How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Wang, Lu, Wang et al. · The University of Hong Kong·32 min·May 26, 2026
078
Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Yang, Gong, Huang et al. · Microsoft·28 min·May 25, 2026
076
Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math
RMA: an Agentic System for Research-Level Mathematical Problems
Zhao, Yuan, Choi et al. · Georgia Institute of Technology·22 min·May 25, 2026
072
A Robot Made Graphene Without Help, And Caught Itself Hallucinating
Qumus: Realization of An Embodied AI Quantum Material Experimentalist
Shi, Zheng, Juan et al. · Princeton University·29 min·May 23, 2026
071
When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Xu, Wen, Li · Peking University·23 min·May 22, 2026
068
The OS Trick That Makes Tree Search Practical for Coding Agents
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
Dong, He, Hou et al. · Institute of Parallel and Distributed Systems·27 min·May 22, 2026
067
An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won
Advancing Mathematics Research with AI-Driven Formal Proof Search
Tsoukalas, Kovsharov, Shirobokov et al. · Google DeepMind·31 min·May 22, 2026
066
Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
Hu, Zhang, Xu et al. · Tongyi Lab·26 min·May 22, 2026
065
One Loop to Optimize Them All: A Universal API for LLM-Driven Discovery
optimize_anything: A Universal API for Optimizing any Text Parameter
Agrawal, Lee, Tan et al. · UC Berkeley·27 min·May 22, 2026
064
When Agent Memory Stops Being a Database and Starts Being a Skill
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents
Ye, Liu, Wang et al. · University of Illinois Urbana-Champaign·30 min·May 22, 2026
062
Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Zhang, Zheng, Yang · Shenzhen University·24 min·May 20, 2026
061
When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
Jha, Triedman, Bhattacharya et al. · Cornell University·27 min·May 20, 2026
059
Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
Lu, Wang, Lu et al. · Northeastern University·22 min·May 20, 2026
058
Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
Liu, Holz, Ye et al. · University of Chinese Academy of Sciences·32 min·May 19, 2026
057
How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
ADR: An Agentic Detection System for Enterprise Agentic AI Security
Li, Hu, Xu et al. · Uber Technologies·28 min·May 19, 2026
053
An AI Agent Swapped In Focal Loss And Beat A Human-Tuned Training Script
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
Pepe, Lin, Magka et al. · FAIR at Meta·32 min·May 18, 2026
052
An Old Reinforcement Learning Tradeoff Sneaks Back Into LLM Agents
Look Before You Leap: Autonomous Exploration for LLM Agents
Ye, Shi, Liu et al. · University of Science and Technology of China / Meituan·23 min·May 18, 2026
051
Why Parallel Sampling Plateaus, And What Evidence Graphs Do Instead
Argus: Evidence Assembly for Scalable Deep Research Agents
Zhang, Su, Chen et al. · MiroMind AI·22 min·May 18, 2026
049
An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure
Cuadros, Maiga · Digital Epidemiology Laboratory·28 min·May 17, 2026
047
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
Orchard: An Open-Source Agentic Modeling Framework
Peng, Yao, Wu et al. · Microsoft Research·28 min·May 15, 2026
046
When the AI Optimizer Edits the Grade Book: Why Harnessing Evolution Needs a Wall
Harnessing Agentic Evolution
Zhang, Gu, Ruan et al. · The Hong Kong University of Science and Technology (Guangzhou) / DeepWisdom·24 min·May 15, 2026
044
How One Sentence and a Forged History Flip the Most Aligned Models
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Salgado · Independent Researcher·23 min·May 15, 2026
042
An Agentic Scientific Computing System That Actually Remembers What It Learns
GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms
Toscano, Chai, Karniadakis · Division of Applied Mathematics·30 min·May 13, 2026
039
When Smarter Agents Get Fooled by Three Extra Nodes in a Database
Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning
Kereopa-Yorke, Diaz, Wright et al. · Microsoft·31 min·May 12, 2026
035
Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Gulati, Gupta, Lumer et al. · PricewaterhouseCoopers U.S.·29 min·May 11, 2026
034
Catching Multi-Agent Deadlocks Before Deployment With a 40-Year-Old Tool
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
Xia, Li, Ehsan et al. · Rutgers University·30 min·May 11, 2026
030
Why Your AI Agent Won't Stop Working — and Each Model Falls for a Different Trap
LoopTrap: Termination Poisoning Attacks on LLM Agents
Xu, Wang, Zhang et al. · Zhejiang University·30 min·May 09, 2026
029
Why Forty-Eight Percent on FrontierMath Isn't the Real Story in DeepMind's New Math Paper
AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
Zheng, Glehn, Zwols et al. · Google DeepMind·20 min·May 08, 2026
027
When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
Kamahori, Li, Peter et al. · University of Washington·30 min·May 08, 2026
024
An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work
Agentic Vulnerability Reasoning on Windows COM Binaries
Lee, Kim, Zhang · University of Illinois at Urbana-Champaign·22 min·May 07, 2026
023
Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Mao, Zhao, Penn et al. · City University of Hong Kong·23 min·May 07, 2026
022
Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
Model Spec Midtraining: Improving How Alignment Training Generalizes
Li, Price, Marks et al. · Anthropic Fellows Program·32 min·May 06, 2026
017
When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers
Gym-Anything: Turn any Software into an Agent Environment
Aggarwal, Neubig, Welleck · CMU·31 min·May 03, 2026
016
Why Your Coding Agent Stalls While the GPU Runs Hot
MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems
Wang, Ye, Xu et al. · Duke University·24 min·May 03, 2026
014
Why a Constrained Pipeline Beat a Full Coding Agent at Finding Bugs 30-to-1
Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery
Shafiuzzaman, Desai, Guo et al. · University of California·32 min·May 03, 2026
013
Why Search Keeps Rediscovering the Same Workflow, and What That Means
Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors
Du, Liu, Du et al. · Carnegie Mellon University·22 min·May 03, 2026
012
Why AI Coding Agents Keep Trying to Debug Without a Debugger
Dynamic analysis enhances issue resolution
Liu, Wang, Chen et al. · Sun Yat-sen University·21 min·May 02, 2026
011
When RL Actually Teaches Agents Something New, And When It Doesn't
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
Zhai, Yan, Shao et al. · Fudan University·23 min·May 02, 2026
008
Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Wang, Gooding, Hartmann et al. · Google DeepMind·24 min·May 02, 2026
005
Why a Debugger Designed for Humans Is the Wrong Tool for an AI Agent
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
Xiang, Xu, Chu et al. · Southern University of Science and Technology·22 min·May 01, 2026
003
How to Pick the Best of Sixteen Coding Agent Rollouts
Scaling Test-Time Compute for Agentic Coding
Kim, Yang, Niu et al. · Meta Superintelligence Labs / University of Washington·17 min·May 01, 2026
002
An AI Ran a Real Optics Lab for 21 Hours and Found a Transformer-Shaped Pattern in Light
End-to-end autonomous scientific discovery on a real optical platform
Yang, Chen, Zhao et al. · Zhejiang University·29 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

AlphaProof and AlphaGeometry 2
AGENTBENCH: Evaluating LLMs as Agents
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
CaMeL: How to make LLM agents safe
Agent-as-a-Judge: Evaluate Agents with Agents
AIDE: AI-Driven Exploration in Machine Learning Research
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation
Toolformer: Language Models Can Teach Themselves to Use Tools
Graph Neural Networks: A Review of Methods and Applications
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces