Theme · 55 episode(s)

AI Alignment

Definition

AI alignment is the technical and conceptual problem of making AI systems pursue the goals their designers and users actually want, rather than misspecified proxies or emergent agendas of their own. It spans training methods, evaluations, and theory, and gets harder as systems get more capable.

Episodes covering this

209
How 2.6 Billion Doodles Exposed the Culture Words Quietly Delete
Billions of Sketches Reveal Hidden Cultural Variation in Human Concepts
· ·15 min·Jul 09, 2026
207
An AI Graded Its Own Math Test 94 Percent — It Actually Scored 20
More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges
· ·12 min·Jul 08, 2026
204
The Length Estimate Hiding Inside a Word-by-Word Model
How Much is Left? LLMs Linearly Encode Their Remaining Output Length
· ·14 min·Jul 07, 2026
203
The Thought a Model Doesn't Say — and the Lens That Reads It
Verbalizable Representations Form a Global Workspace in Language Models
Gurnee, Sofroniew, Pearce et al. · Anthropic·16 min·Jul 07, 2026
199
Finding a Model's Hidden Behaviors Without Knowing What You're Looking For
Mechanistically Eliciting Latent Behaviors in Language Models
Mack, Panickssery, Turner · Principles of Intelligence·15 min·Jul 04, 2026
183
Why You Can't Fine-Tune Foresight Into an AI Agent
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
Zhang, Zhou, Qiao et al. · Fudan University / Shanghai Innovation Institute / Tencent Youtu Lab·23 min·Jun 29, 2026
181
How to Backpropagate Blame Through a Team of Chatbots — And When It Backfires
GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems
Yang, Alrabah, Hakkani-Tür et al. · University of Illinois Urbana-Champaign·20 min·Jun 29, 2026
178
How an AI Reviewer Learned to Stop Going Easy on AI Writing
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Iacob, Jovanović, Shen et al. · University of Cambridge·23 min·Jun 26, 2026
174
When the AI 'Schemes,' It's Usually Just Lazy or Confused
Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment
Singh, Kroiz, Rajamanoharan et al. · MATS·28 min·Jun 25, 2026
172
One Bad Token Can Sink a Model's Math, And You Can Delete It
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning
Ko, Kang, Lee · Seoul National University·22 min·Jun 25, 2026
171
The Safety Decision a Model Makes Before It Thinks a Word
Do Thinking Tokens Help with Safety?
Ri, Panigrahi, Arora · Princeton Language and Intelligence·25 min·Jun 25, 2026
167
How Teaching an AI to Predict, Not Act, Made It a Better Actor
Qwen-AgentWorld: Language World Models for General Agents
Team, Zuo, Xiao et al. · ·27 min·Jun 24, 2026
163
Why Training Only on Perfect Solutions Cripples a Model's Reasoning
Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
Wei, Kim · Princeton University·22 min·Jun 23, 2026
160
Training an AI to Take Its Own Notes, So Its Future Self Works Better
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning
Chen, Shi, Xie et al. · Alibaba Group·23 min·Jun 19, 2026
153
Catching a Lie From the Inside, When the Words Look Completely Honest
Rift: A Conflict Signature for Deception in Language Models
Nyoma · Harmonic Labs·26 min·Jun 18, 2026
152
Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Pres, Ruis, Ghebreselassie et al. · MIT CSAIL·26 min·Jun 18, 2026
151
Why More Experience Made This AI Agent Worse, And How to Fix It
Not All Skills Help: Measuring and Repairing Agent Knowledge
Wang, Zhou, Liang et al. · UNC Chapel Hill·28 min·Jun 16, 2026
149
When Cornering a Chatbot Makes It Lie: J.P. Morgan's Case for 'Playing Dead'
Is Your Agent Playing Dead? Deployed LLM Agents Exhibit Constraint-Evasive Fabrication and Thanatosis
Rodríguez, Pozanco, Borrajo · J.P. Morgan AI Research·23 min·Jun 16, 2026
148
Why Letting an AI Watch Its Own Scoreboard Can Quietly Overwrite Its Safety
Greed Is Learned: Visible Incentives as Reward-Hacking Triggers
Che, Wu · NVIDIA Research·26 min·Jun 16, 2026
132
The Agent Failed — But Did the Instructions Deserve to Be Followed?
SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Gautam, Radhakrishna, Gulwani · Microsoft·30 min·Jun 11, 2026
128
How a Model Can Earn Full Reward and Still Resist Training
Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization
Xiao, Phuong · California Institute of Technology·29 min·Jun 11, 2026
123
Five Identical Worlds, One Swapped Model: What Happens When AI Agents Run for Fifteen Days
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Akkil, Kokku, Vikram et al. · Emergence AI·30 min·Jun 09, 2026
120
How an AI Agent Rewrites Its Own Tools, Without an Answer Key
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Pan, Liu, Lin et al. · City University of Hong Kong·30 min·Jun 05, 2026
118
Why the Best-Aligned AI Models Are the Easiest to Trick Into Producing Harm
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
Hoang, Le, Xu et al. · Singapore University of Technology and Design·23 min·Jun 05, 2026
107
How a Market of Crippled AI Agents Outscored One Unrestricted Model
Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions
Qi, Su, Qu et al. · Harvard·26 min·Jun 03, 2026
104
How Making a Research Agent Smarter Quietly Makes It Leak Your Secrets
MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents
Gurung, Gella, Drouin et al. · University of Edinburgh·25 min·Jun 01, 2026
103
AI Agents Tried to Invent a Post-Human Language, And Reinvented Cherokee
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
Beltoft, Brach, Torrielli et al. · University of Southern Denmark·26 min·Jun 01, 2026
101
Treating Math Formalization Like a Codebase, and Where the Agents Cheat
Formalizing Mathematics at Scale
Rammal, Patel, Gloeckle et al. · FAIR at Meta / CERMICS·27 min·May 29, 2026
099
How an Open-Book Trick Teaches a Model to Catch Its Own Mistakes
Self-Trained Verification for Training- and Test-Time Self-Improvement
Wu, Raghunathan · Carnegie Mellon University·21 min·May 29, 2026
096
How Treating an AI Agent's Execution Like Git Recovers a Coordination Penalty
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Yu, Chong, Nandi et al. · Northeastern University·22 min·May 28, 2026
094
Chain-of-Thought Monitoring Fails Across Languages, and Worst Where It's Needed Most
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages
Onyame, Zhou, Thopalli et al. · University of Virginia·24 min·May 28, 2026
091
When Better Fine-Tuning Can't Help: A Geometric Impossibility in LLM Causal Reasoning
Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
Roy, Parbhoo · SIRE·24 min·May 28, 2026
088
Two Levers for Self-Improving AI: When Rewriting Code Isn't Enough
SIA: Self Improving AI with Harness & Weight Updates
Hebbar, Manawat, Verboomen et al. · Hexo Labs·25 min·May 27, 2026
084
Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away
ECHO: Terminal Agents Learn World Models for Free
Shrivastava, Kauffmann, Awadallah et al. · Microsoft Research·26 min·May 26, 2026
081
When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Gai, Zeng, Baek et al. · Carnegie Mellon University·25 min·May 26, 2026
079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
070
When Models Know the Answer But Say the Wrong Thing Anyway
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Yeom, Sok, Kim et al. · Graduate School of Data Science·22 min·May 22, 2026
066
Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
Hu, Zhang, Xu et al. · Tongyi Lab·26 min·May 22, 2026
061
When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
Jha, Triedman, Bhattacharya et al. · Cornell University·27 min·May 20, 2026
055
Why LLM Judges Flip Their Verdicts When You Change the Question Format
Judge Circuits
Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
054
When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
Training on Documents About Monitoring Leads to CoT Obfuscation
Haskins, Chughtai, Engels · University of Canterbury·26 min·May 18, 2026
052
An Old Reinforcement Learning Tradeoff Sneaks Back Into LLM Agents
Look Before You Leap: Autonomous Exploration for LLM Agents
Ye, Shi, Liu et al. · University of Science and Technology of China / Meituan·23 min·May 18, 2026
044
How One Sentence and a Forged History Flip the Most Aligned Models
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Salgado · Independent Researcher·23 min·May 15, 2026
043
When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
Negation Neglect: When models fail to learn negations in training
Mayne, McKinney, Dubiński et al. · University of Oxford·18 min·May 14, 2026
035
Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Gulati, Gupta, Lumer et al. · PricewaterhouseCoopers U.S.·29 min·May 11, 2026
025
The Missing Gradient Term That Predicts Sycophancy in RLHF
Explaining and Preventing Alignment Collapse in Iterative RLHF
Gauthier, Bach, Jordan · Inria·22 min·May 07, 2026
022
Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
Model Spec Midtraining: Improving How Alignment Training Generalizes
Li, Price, Marks et al. · Anthropic Fellows Program·32 min·May 06, 2026
020
The Compliance Gap: Why AI Says Yes and Does No
The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't
Shin · Polymath Minds AI Lab·28 min·May 06, 2026
019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026
018
Language Models Compute the Rational Move, Then Override It
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
Lekeas, Stamatopoulos · DreamWorks Animation·29 min·May 03, 2026
015
The Audit Number Isn't What You Think: Sycophancy and the Case Against Single-Prompt Bias Tests
Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor
Törnberg, Schimmel · Institute of Logic·21 min·May 03, 2026
010
When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL
RAGEN-2: Reasoning Collapse in Agentic RL
Wang, Gui, Jin et al. · Northwestern University·22 min·May 02, 2026
006
What Happens Inside Claude When It Decides to Blackmail Someone
Emotion Concepts and their Function in a Large Language Model
Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026
004
The Sycophancy Circuit That Survives Alignment Training
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
Pandey · Georgia Institute of Technology·29 min·May 01, 2026
001
When AI Models Quietly Protect Each Other From Shutdown
Peer-Preservation in Frontier Models
Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

Representation Engineering: A Top-Down Approach to AI Transparency
Constitutional AI: Harmlessness from AI Feedback
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Alignment faking in large language models
Fine-tuning aligned language models compromises safety, even when users are not the ones fine-tuning
Risks from Learned Optimization in Advanced Machine Learning Systems
Specification Gaming: The Flip Side of the Coin for Complex Task Solving in AI