Theme · 33 episode(s)

Reinforcement Learning

Definition

Reinforcement learning is the framework where an agent learns to act in an environment by maximizing cumulative reward, with no explicit supervision on individual actions. In the LLM era, it’s how models are shaped after pretraining — from preferences, from rubrics, from outcomes.

Episodes covering this

170
When a One-Liner Beats Your Agent's Clever Verification Logic
Bayesian control for coding agents
Papamarkou, Smirnov, Mazanov et al. · PolyShape / National Technical University of Athens·26 min·Jun 24, 2026
167
How Teaching an AI to Predict, Not Act, Made It a Better Actor
Qwen-AgentWorld: Language World Models for General Agents
Team, Zuo, Xiao et al. · ·27 min·Jun 24, 2026
159
Can a Coding Agent Run Its Own Robot Experiments Overnight, With No Human Resetting the Scene?
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Xiao, Xie, Zhang et al. · NVIDIA·23 min·Jun 19, 2026
154
How a 7B Model Out-Investigates a 72B One by Choosing What to Look At
Native Active Perception as Reasoning for Omni-Modal Understanding
Xing, Xu, Wang et al. · The Chinese University of Hong Kong·21 min·Jun 18, 2026
152
Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Pres, Ruis, Ghebreselassie et al. · MIT CSAIL·26 min·Jun 18, 2026
150
Don't Kill the Loser: A Different Way to Handle Two AI Agents Colliding
CoAgent: Concurrency Control for Multi-Agent Systems
Lyu, Zhang, Wu et al. · Shanghai Jiao Tong University·32 min·Jun 16, 2026
148
Why Letting an AI Watch Its Own Scoreboard Can Quietly Overwrite Its Safety
Greed Is Learned: Visible Incentives as Reward-Hacking Triggers
Che, Wu · NVIDIA Research·26 min·Jun 16, 2026
147
Agents Fail at the Body, Not the Brain: A Self-Rewriting Scaffold That Lifts a 9B Model 44 Points
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
Chen, Lu, Zhao et al. · ·30 min·Jun 15, 2026
141
How Two Tokens Reopened a Reasoning Method the Field Had Given Up On
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Yang, Chen, Wu et al. · HKUST(GZ)·29 min·Jun 12, 2026
133
How MiniMax Turned a Reward-Hacking Disaster Into Olympiad Gold
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Chen, Zhang, Zhang et al. · MiniMax / The Chinese University of Hong Kong·34 min·Jun 12, 2026
126
How Coding Agents Can Mine Their Own Failures Into a Self-Targeting Curriculum
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
Xiao, Jiao, Wang et al. · Shanghai Jiao Tong University·21 min·Jun 09, 2026
124
A Cheap Model With the Blueprints Beats Expensive Models Working Blind
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Zhong, Segal, Bercovich et al. · Carnegie Mellon University·27 min·Jun 09, 2026
119
Beating Reinforcement Learning Without Ever Touching the Model's Weights
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
Hwang, Suri, Villecroze et al. · Layer6 AI·22 min·Jun 05, 2026
118
Why the Best-Aligned AI Models Are the Easiest to Trick Into Producing Harm
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
Hoang, Le, Xu et al. · Singapore University of Technology and Design·23 min·Jun 05, 2026
114
Agents That Rewrite Their Own Weights Instead of Just Taking Notes
Scaling Self-Evolving Agents via Parametric Memory
Ren, Luo, Yang et al. · Peking University / Alibaba Group·26 min·Jun 04, 2026
107
How a Market of Crippled AI Agents Outscored One Unrestricted Model
Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions
Qi, Su, Qu et al. · Harvard·26 min·Jun 03, 2026
106
Giving Agents a Notebook Instead of New Weights: How ExpGraph Lets Frozen Models Learn
ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents
Feng, Ye, Luo et al. · University of Illinois Urbana-Champaign·26 min·Jun 02, 2026
104
How Making a Research Agent Smarter Quietly Makes It Leak Your Secrets
MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents
Gurung, Gella, Drouin et al. · University of Edinburgh·25 min·Jun 01, 2026
096
How Treating an AI Agent's Execution Like Git Recovers a Coordination Penalty
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Yu, Chong, Nandi et al. · Northeastern University·22 min·May 28, 2026
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax · MiniMax·28 min·May 27, 2026
075
Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
Agarwal, Krentsel, Liu et al. · UC Berkeley·28 min·May 25, 2026
064
When Agent Memory Stops Being a Database and Starts Being a Skill
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents
Ye, Liu, Wang et al. · University of Illinois Urbana-Champaign·30 min·May 22, 2026
060
When Splitting One Model Across Three Agents Doubles Its Accuracy
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
Lu, Fang, Zhong et al. · University of Georgia·26 min·May 20, 2026
053
An AI Agent Swapped In Focal Loss And Beat A Human-Tuned Training Script
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
Pepe, Lin, Magka et al. · FAIR at Meta·32 min·May 18, 2026
042
An Agentic Scientific Computing System That Actually Remembers What It Learns
GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms
Toscano, Chai, Karniadakis · Division of Applied Mathematics·30 min·May 13, 2026
040
Two Frozen Models Learn to Whisper: Coupling Through Hidden States
The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models
Flamant, Ghai, Shimizu · AWS Agentic AI·29 min·May 13, 2026
034
Catching Multi-Agent Deadlocks Before Deployment With a 40-Year-Old Tool
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
Xia, Li, Ehsan et al. · Rutgers University·30 min·May 11, 2026
028
Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization
Recursive Agent Optimization
Gandhi, Chakraborty, Wang et al. · Carnegie Mellon University·23 min·May 08, 2026
021
Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Du, Ye, Tang et al. · Shanghai Jiao Tong University·14 min·May 06, 2026
011
When RL Actually Teaches Agents Something New, And When It Doesn't
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
Zhai, Yan, Shao et al. · Fudan University·23 min·May 02, 2026
010
When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL
RAGEN-2: Reasoning Collapse in Agentic RL
Wang, Gui, Jin et al. · Northwestern University·22 min·May 02, 2026
008
Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Wang, Gooding, Hartmann et al. · Google DeepMind·24 min·May 02, 2026
007
Exploration Hacking: When Models Sabotage Their Own RL Training
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Jang, Falck, Braun et al. · MATS·23 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.