Concept · 33 episode(s)

Supervised Fine-Tuning

Definition

SFT (Supervised Fine-Tuning) trains a pretrained model on (input, target output) pairs to teach a specific behavior or format. It’s the simplest post-training method and the first step in most modern alignment pipelines before any RL.

Episodes covering this

192
A 32B Open Model Matched Frontier Systems By Learning to Take Notes
AutoMem: Automated Learning of Memory as a Cognitive Skill
Wu, Zhu, Zhang et al. · Stanford University·22 min·Jul 02, 2026
189
Why Phone Agents Ace the Test and Crash on Your Actual Phone
Xiaomi-GUI-0 Technical Report
Team, Qu, Luan · Xiaomi·24 min·Jul 02, 2026
187
An 8-Billion Agent That Beats Models 80 Times Its Size By Looking Things Up
An AI agent for treatment reasoning over a biomedical tool universe
Gao, Noori, Zhu et al. · Department of Biomedical Informatics·19 min·Jun 30, 2026
183
Why You Can't Fine-Tune Foresight Into an AI Agent
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
Zhang, Zhou, Qiao et al. · Fudan University / Shanghai Innovation Institute / Tencent Youtu Lab·23 min·Jun 29, 2026
167
How Teaching an AI to Predict, Not Act, Made It a Better Actor
Qwen-AgentWorld: Language World Models for General Agents
Team, Zuo, Xiao et al. · ·27 min·Jun 24, 2026
166
A Router That Beats the Frontier Models It Calls
Sakana Fugu Technical Report
Tang, Cetin, Xu et al. · Sakana AI·26 min·Jun 23, 2026
163
Why Training Only on Perfect Solutions Cripples a Model's Reasoning
Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
Wei, Kim · Princeton University·22 min·Jun 23, 2026
159
Can a Coding Agent Run Its Own Robot Experiments Overnight, With No Human Resetting the Scene?
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Xiao, Xie, Zhang et al. · NVIDIA·23 min·Jun 19, 2026
156
Why More Human Demonstrations Made a Computer-Use Agent Worse
ProCUA-SFT Technical Report
Jung, Lu, Cui et al. · NVIDIA / University of Washington·20 min·Jun 18, 2026
155
Why a Flawless Demo Makes a Worse Computer-Using Agent, And the Fix
Skill-Guided Continuation Distillation for GUI Agents
Fan, Yu, Shen et al. · StepFun·22 min·Jun 18, 2026
154
How a 7B Model Out-Investigates a 72B One by Choosing What to Look At
Native Active Perception as Reasoning for Omni-Modal Understanding
Xing, Xu, Wang et al. · The Chinese University of Hong Kong·21 min·Jun 18, 2026
148
Why Letting an AI Watch Its Own Scoreboard Can Quietly Overwrite Its Safety
Greed Is Learned: Visible Incentives as Reward-Hacking Triggers
Che, Wu · NVIDIA Research·26 min·Jun 16, 2026
141
How Two Tokens Reopened a Reasoning Method the Field Had Given Up On
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Yang, Chen, Wu et al. · HKUST(GZ)·29 min·Jun 12, 2026
115
Teaching a Phone Agent to Reason Silently, And Keeping It Honest
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Yang, Hu, Hao et al. · Beihang University·24 min·Jun 04, 2026
111
How a 4B Web Agent Beat Models 60x Its Size on 500 Demonstrations
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Yang, Wu, Chen et al. · UIUC·24 min·Jun 03, 2026
108
The Reasoning Cliff: Why Thinking Longer Makes Models Worse at Exact Step-by-Step Tasks
The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary
Guo, Wu, Yiu · The University of Hong Kong·32 min·Jun 03, 2026
099
How an Open-Book Trick Teaches a Model to Catch Its Own Mistakes
Self-Trained Verification for Training- and Test-Time Self-Improvement
Wu, Raghunathan · Carnegie Mellon University·21 min·May 29, 2026
091
When Better Fine-Tuning Can't Help: A Geometric Impossibility in LLM Causal Reasoning
Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
Roy, Parbhoo · SIRE·24 min·May 28, 2026
084
Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away
ECHO: Terminal Agents Learn World Models for Free
Shrivastava, Kauffmann, Awadallah et al. · Microsoft Research·26 min·May 26, 2026
082
Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks
Xie, Lin, Wang et al. · The Ohio State University·31 min·May 26, 2026
080
How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Wang, Lu, Wang et al. · The University of Hong Kong·32 min·May 26, 2026
074
How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning
HRM-Text: Efficient Pretraining Beyond Scaling
Wang, Liu, Wang et al. · Sapient Intelligence·21 min·May 24, 2026
071
When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Xu, Wen, Li · Peking University·23 min·May 22, 2026
070
When Models Know the Answer But Say the Wrong Thing Anyway
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Yeom, Sok, Kim et al. · Graduate School of Data Science·22 min·May 22, 2026
066
Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
Hu, Zhang, Xu et al. · Tongyi Lab·26 min·May 22, 2026
048
How a 30B Open Model Reached Olympiad Gold With the Right Recipe
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Li, Zhan, Zhang et al. · Shanghai AI Laboratory / The Chinese University of Hong Kong·31 min·May 16, 2026
047
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
Orchard: An Open-Source Agentic Modeling Framework
Peng, Yao, Wu et al. · Microsoft Research·28 min·May 15, 2026
043
When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
Negation Neglect: When models fail to learn negations in training
Mayne, McKinney, Dubiński et al. · University of Oxford·18 min·May 14, 2026
032
A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
Aviss · Fifth Dimension·23 min·May 09, 2026
021
Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Du, Ye, Tang et al. · Shanghai Jiao Tong University·14 min·May 06, 2026
019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026
011
When RL Actually Teaches Agents Something New, And When It Doesn't
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
Zhai, Yan, Shao et al. · Fudan University·23 min·May 02, 2026
009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Limozin, Durech, Hoefler et al. · ETH AI Center·23 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.