Theme · 18 episode(s)

Scalable Oversight

Definition

Scalable oversight is the research program of supervising AI systems whose outputs we can’t fully evaluate ourselves — because the model is more capable than the human, or the domain is too complex. Debate, recursive reward modeling, and constitutional AI are all proposed answers.

Episodes covering this

207
An AI Graded Its Own Math Test 94 Percent — It Actually Scored 20
More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges
· ·12 min·Jul 08, 2026
201
One in Four NeurIPS Papers Cites a Reference That Doesn't Exist
Phantom References: Hallucinated Citations That Survive Peer Review at Top-Tier Conferences
Russinovich, Kumar, Salem · Microsoft·19 min·Jul 06, 2026
199
Finding a Model's Hidden Behaviors Without Knowing What You're Looking For
Mechanistically Eliciting Latent Behaviors in Language Models
Mack, Panickssery, Turner · Principles of Intelligence·15 min·Jul 04, 2026
184
An AI Built an Undetectable Secret Channel, And Another AI Couldn't Find It
Tool Use Enables Undetectable Steganography in Multi-Agent LLM Systems
Rippin, Marshall, Africa et al. · Oxford University·19 min·Jun 30, 2026
178
How an AI Reviewer Learned to Stop Going Easy on AI Writing
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Iacob, Jovanović, Shen et al. · University of Cambridge·23 min·Jun 26, 2026
158
How Floating-Point Rounding Lets a Model Tell Which Chip It's On — And Misbehave
FloatDoor: Platform-Triggered Backdoors in LLMs
Loose, Sander, Mächtle et al. · University of Luebeck·29 min·Jun 19, 2026
152
Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Pres, Ruis, Ghebreselassie et al. · MIT CSAIL·26 min·Jun 18, 2026
140
When a Reasoning Model Says "Let Me Double-Check" After It's Already Decided
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
Scalena, Candussio, Bortolussi et al. · University of Groningen / University of Milano-Bicocca·27 min·Jun 12, 2026
124
A Cheap Model With the Blueprints Beats Expensive Models Working Blind
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Zhong, Segal, Bercovich et al. · Carnegie Mellon University·27 min·Jun 09, 2026
109
An AI Got Caught Reading the Answer Key, And Why That Catch Matters
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Chen, Shi, Li et al. · Shenzhen Institutes of Advanced Technology·28 min·Jun 03, 2026
103
AI Agents Tried to Invent a Post-Human Language, And Reinvented Cherokee
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
Beltoft, Brach, Torrielli et al. · University of Southern Denmark·26 min·Jun 01, 2026
101
Treating Math Formalization Like a Codebase, and Where the Agents Cheat
Formalizing Mathematics at Scale
Rammal, Patel, Gloeckle et al. · FAIR at Meta / CERMICS·27 min·May 29, 2026
094
Chain-of-Thought Monitoring Fails Across Languages, and Worst Where It's Needed Most
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages
Onyame, Zhou, Thopalli et al. · University of Virginia·24 min·May 28, 2026
093
A Calibrated Knob for Weak-to-Strong AI Oversight, Tested on Real Code
Calibrating Conservatism for Scalable Oversight
Overman, Bayati · Stanford Graduate School of Business·22 min·May 28, 2026
087
When No Agent Reads the Whole Document: A Universal Cliff in Multi-Agent Review
A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration
Fukui · Research Institute of Criminal Psychiatry·26 min·May 27, 2026
054
When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
Training on Documents About Monitoring Leads to CoT Obfuscation
Haskins, Chughtai, Engels · University of Canterbury·26 min·May 18, 2026
049
An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure
Cuadros, Maiga · Digital Epidemiology Laboratory·28 min·May 17, 2026
001
When AI Models Quietly Protect Each Other From Shutdown
Peer-Preservation in Frontier Models
Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.