Concept · 47 episode(s)

Ablation Studies

Definition

Ablation studies are experiments that selectively remove, disable, or replace one component of a system to measure how much that piece contributes to overall performance. They are the workhorse method for arguing causality in ML papers: if accuracy collapses when you delete a module, that module was doing real work.

Episodes covering this

210
Same Website Request, Different Code — The Bias You Can't See
Biased or Personalized? The Impact of Personal Information on AI-driven Development
· ·14 min·Jul 09, 2026
207
An AI Graded Its Own Math Test 94 Percent — It Actually Scored 20
More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges
· ·12 min·Jul 08, 2026
206
How Four-Second Clips Become Hours of Playable AI Soccer
Multiplayer Interactive World Models with Representation Autoencoders
· ·15 min·Jul 07, 2026
204
The Length Estimate Hiding Inside a Word-by-Word Model
How Much is Left? LLMs Linearly Encode Their Remaining Output Length
· ·14 min·Jul 07, 2026
200
The One Mechanism That Turns Twenty AI Clones Into an Actual Team
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
Zhang, Xu, Dai et al. · Oregon State University; AG2AI·19 min·Jul 04, 2026
198
The Model That Knows the Answer and Can't Say It
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
Gollapudi, Gupta, Singhal et al. · UC Berkeley·17 min·Jul 03, 2026
197
Twin Problems Suggest AI Reasoning Gains Are Mostly Better Fact Recall
IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs
Abdaljalil, Serpedin, Kurban · Texas A&M University·17 min·Jul 03, 2026
193
Freeze Most of the Network: Where RL Improvement Actually Lives in a Transformer
Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
Zhang, Hu, Glentis et al. · University of Minnesota·22 min·Jul 02, 2026
182
How a Tiny Model Too Weak to Plan Cuts a Bigger Agent's Hallucinations by 80%
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
Song, Cai · Emory University·17 min·Jun 29, 2026
178
How an AI Reviewer Learned to Stop Going Easy on AI Writing
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Iacob, Jovanović, Shen et al. · University of Cambridge·23 min·Jun 26, 2026
169
Why Better Bug Reports Can Make AI Coding Agents Worse
SHERLOC: Structured Diagnostic Localization for Code Repair Agents
Tamoyan, Narenthiran, Arakelyan et al. · NVIDIA / TU Darmstadt·24 min·Jun 24, 2026
168
When Turning Experience Into Code Makes Your AI Agent Dumber
Metis: Bridging Text and Code Memory for Self-Evolving Agents
Dai, He, Li et al. · The Chinese University of Hong Kong·27 min·Jun 24, 2026
154
How a 7B Model Out-Investigates a 72B One by Choosing What to Look At
Native Active Perception as Reasoning for Omni-Modal Understanding
Xing, Xu, Wang et al. · The Chinese University of Hong Kong·21 min·Jun 18, 2026
151
Why More Experience Made This AI Agent Worse, And How to Fix It
Not All Skills Help: Measuring and Repairing Agent Knowledge
Wang, Zhou, Liang et al. · UNC Chapel Hill·28 min·Jun 16, 2026
144
When an AI Agent Just Copies Its Tool — And Bigger Models Copy More
When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More
Wang, Vemuri · raptorX.ai·15 min·Jun 15, 2026
139
When Optimizing One GPU Kernel Quietly Breaks the Whole System
Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Prakriya, Hou, Gong et al. · AMD·30 min·Jun 12, 2026
131
Why Autonomous Research Agents Forget Their Own Lessons, and Arbor's Fix
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Jin, Hu, Qiu et al. · Renmin University of China·33 min·Jun 11, 2026
127
What Diffusion Language Models Were Missing: A Map, Not an Algorithm
TextLDM: Language Modeling with Continuous Latent Diffusion
Jiang, Ren, Li et al. · JoyFuture Academy / HIT·30 min·Jun 11, 2026
121
When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model
From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws
Chen, Wang, Liu et al. · Institute of Software·27 min·Jun 05, 2026
120
How an AI Agent Rewrites Its Own Tools, Without an Answer Key
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Pan, Liu, Lin et al. · City University of Hong Kong·30 min·Jun 05, 2026
115
Teaching a Phone Agent to Reason Silently, And Keeping It Honest
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Yang, Hu, Hao et al. · Beihang University·24 min·Jun 04, 2026
111
How a 4B Web Agent Beat Models 60x Its Size on 500 Demonstrations
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Yang, Wu, Chen et al. · UIUC·24 min·Jun 03, 2026
107
How a Market of Crippled AI Agents Outscored One Unrestricted Model
Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions
Qi, Su, Qu et al. · Harvard·26 min·Jun 03, 2026
106
Giving Agents a Notebook Instead of New Weights: How ExpGraph Lets Frozen Models Learn
ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents
Feng, Ye, Luo et al. · University of Illinois Urbana-Champaign·26 min·Jun 02, 2026
105
The Trojan Is Your Agent's Memory: Why Single-Step Defenses Miss Persistent Attacks
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
Tan, Dou, Yang et al. · Gaoling School of Artificial Intelligence·26 min·Jun 01, 2026
101
Treating Math Formalization Like a Codebase, and Where the Agents Cheat
Formalizing Mathematics at Scale
Rammal, Patel, Gloeckle et al. · FAIR at Meta / CERMICS·27 min·May 29, 2026
100
How a Prompt Wrapper Lets a Frontier Model Play Poker Like an Expert
PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers
Li, Wang, Huang · IIIS·29 min·May 29, 2026
097
Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents
Scaling Laws for Agent Harnesses via Effective Feedback Compute
Zhang, Wang, Xu et al. · Harbin Institute of Technology·25 min·May 29, 2026
095
Seven Wins to Zero: How Organizing AI Agents Like a Lab Changes the Search
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
Gao, Fang, Zitnik · Harvard University·24 min·May 28, 2026
083
Training the Translator: How a Small Communication Model Lets Agent Teams Outperform Themselves
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning
Hu, Qian, Wang et al. · GSAI·24 min·May 26, 2026
079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
078
Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Yang, Gong, Huang et al. · Microsoft·28 min·May 25, 2026
077
Reading a Model's Confidence Curve to Decide When Chain-of-Thought Is Worth It
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
Xia, Wang, Tang et al. · State Key Laboratory of General Artificial Intelligence·22 min·May 25, 2026
076
Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math
RMA: an Agentic System for Research-Level Mathematical Problems
Zhao, Yuan, Choi et al. · Georgia Institute of Technology·22 min·May 25, 2026
075
Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
Agarwal, Krentsel, Liu et al. · UC Berkeley·28 min·May 25, 2026
074
How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning
HRM-Text: Efficient Pretraining Beyond Scaling
Wang, Liu, Wang et al. · Sapient Intelligence·21 min·May 24, 2026
071
When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Xu, Wen, Li · Peking University·23 min·May 22, 2026
070
When Models Know the Answer But Say the Wrong Thing Anyway
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Yeom, Sok, Kim et al. · Graduate School of Data Science·22 min·May 22, 2026
069
When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Merrill, Lee, Karger · Forecasting Research Institute / UC Berkeley·30 min·May 22, 2026
064
When Agent Memory Stops Being a Database and Starts Being a Skill
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents
Ye, Liu, Wang et al. · University of Illinois Urbana-Champaign·30 min·May 22, 2026
053
An AI Agent Swapped In Focal Loss And Beat A Human-Tuned Training Script
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
Pepe, Lin, Magka et al. · FAIR at Meta·32 min·May 18, 2026
041
When the Iteration Teaches the Model to Skip the Iteration
Solve the Loop: Attractor Models for Language and Reasoning
Fein-Ashley, Rashidinejad · University of Southern California·30 min·May 13, 2026
040
Two Frozen Models Learn to Whisper: Coupling Through Hidden States
The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models
Flamant, Ghai, Shimizu · AWS Agentic AI·29 min·May 13, 2026
033
Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Sridhar, Johansen · California·24 min·May 11, 2026
021
Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Du, Ye, Tang et al. · Shanghai Jiao Tong University·14 min·May 06, 2026
013
Why Search Keeps Rediscovering the Same Workflow, and What That Means
Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors
Du, Liu, Du et al. · Carnegie Mellon University·22 min·May 03, 2026
012
Why AI Coding Agents Keep Trying to Debug Without a Debugger
Dynamic analysis enhances issue resolution
Liu, Wang, Chen et al. · Sun Yat-sen University·21 min·May 02, 2026