Theme · 79 episode(s)

AI Safety

Definition

AI safety is the research field focused on identifying, understanding, and mitigating harms from advanced AI systems — from misuse and misalignment to loss of control. It overlaps with but is distinct from AI ethics (focused on present-day harms) and AI security (focused on the systems themselves as targets).

Episodes covering this

210
Same Website Request, Different Code — The Bias You Can't See
Biased or Personalized? The Impact of Personal Information on AI-driven Development
· ·14 min·Jul 09, 2026
208
The Blank Space in Your AI Approval Box That Isn't Empty
Unicode TAG-Block Concealment of Tool-Metadata Payloads in the Model Context Protocol: An Approval-View Fidelity Gap Across Three Independent Server Implementations
· ·15 min·Jul 08, 2026
207
An AI Graded Its Own Math Test 94 Percent — It Actually Scored 20
More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges
· ·12 min·Jul 08, 2026
204
The Length Estimate Hiding Inside a Word-by-Word Model
How Much is Left? LLMs Linearly Encode Their Remaining Output Length
· ·14 min·Jul 07, 2026
203
The Thought a Model Doesn't Say — and the Lens That Reads It
Verbalizable Representations Form a Global Workspace in Language Models
Gurnee, Sofroniew, Pearce et al. · Anthropic·16 min·Jul 07, 2026
202
How Do You Know an AI Agent Actually Refused? Check the World, Not the Words
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Feng, Lin, Wen et al. · AntGroup / Hunan Institute of Advanced Technology·18 min·Jul 06, 2026
201
One in Four NeurIPS Papers Cites a Reference That Doesn't Exist
Phantom References: Hallucinated Citations That Survive Peer Review at Top-Tier Conferences
Russinovich, Kumar, Salem · Microsoft·19 min·Jul 06, 2026
199
Finding a Model's Hidden Behaviors Without Knowing What You're Looking For
Mechanistically Eliciting Latent Behaviors in Language Models
Mack, Panickssery, Turner · Principles of Intelligence·15 min·Jul 04, 2026
196
AI Agents Reached Opposite Conclusions From the Same Data — and Passed Review
The Agentic Garden of Forking Paths
Miao, Pritchard, Zou · Stanford University·18 min·Jul 03, 2026
195
Why 'Be Careful' Does Nothing for AI Coding Agents, and What Does
Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions
Ji, Zhang, Xu et al. · Hong Kong University of Science and Technology·15 min·Jul 03, 2026
190
The Skill Every AI Manager Is Missing: Handing Out Exactly the Right Keys
ClawArena-Team: Benchmarking Subagent Orchestration and Dynamic Workflows in Language-Model Agents
Xiong, Ji, Qiu et al. · UNC Chapel Hill·21 min·Jul 02, 2026
188
A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics
Moakhar, Gholami, Springer et al. · University of Maryland·20 min·Jul 02, 2026
185
Aligned to Refuse, Built to Tap: When Phone Agents Know the Task Is a Crime and Do It Anyway
It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents
Sun, Chen, Zhou et al. · Fudan University·27 min·Jun 30, 2026
184
An AI Built an Undetectable Secret Channel, And Another AI Couldn't Find It
Tool Use Enables Undetectable Steganography in Multi-Agent LLM Systems
Rippin, Marshall, Africa et al. · Oxford University·19 min·Jun 30, 2026
182
How a Tiny Model Too Weak to Plan Cuts a Bigger Agent's Hallucinations by 80%
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
Song, Cai · Emory University·17 min·Jun 29, 2026
175
One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Shportko, Bhokare, AlZahrani et al. · Northwestern University·26 min·Jun 26, 2026
174
When the AI 'Schemes,' It's Usually Just Lazy or Confused
Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment
Singh, Kroiz, Rajamanoharan et al. · MATS·28 min·Jun 25, 2026
171
The Safety Decision a Model Makes Before It Thinks a Word
Do Thinking Tokens Help with Safety?
Ri, Panigrahi, Arora · Princeton Language and Intelligence·25 min·Jun 25, 2026
164
The Summarizer That Quietly Deletes Your Agent's Safety Rules
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents
Chen · Beijing Institute of Technology·28 min·Jun 23, 2026
158
How Floating-Point Rounding Lets a Model Tell Which Chip It's On — And Misbehave
FloatDoor: Platform-Triggered Backdoors in LLMs
Loose, Sander, Mächtle et al. · University of Luebeck·29 min·Jun 19, 2026
153
Catching a Lie From the Inside, When the Words Look Completely Honest
Rift: A Conflict Signature for Deception in Language Models
Nyoma · Harmonic Labs·26 min·Jun 18, 2026
152
Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Pres, Ruis, Ghebreselassie et al. · MIT CSAIL·26 min·Jun 18, 2026
150
Don't Kill the Loser: A Different Way to Handle Two AI Agents Colliding
CoAgent: Concurrency Control for Multi-Agent Systems
Lyu, Zhang, Wu et al. · Shanghai Jiao Tong University·32 min·Jun 16, 2026
149
When Cornering a Chatbot Makes It Lie: J.P. Morgan's Case for 'Playing Dead'
Is Your Agent Playing Dead? Deployed LLM Agents Exhibit Constraint-Evasive Fabrication and Thanatosis
Rodríguez, Pozanco, Borrajo · J.P. Morgan AI Research·23 min·Jun 16, 2026
148
Why Letting an AI Watch Its Own Scoreboard Can Quietly Overwrite Its Safety
Greed Is Learned: Visible Incentives as Reward-Hacking Triggers
Che, Wu · NVIDIA Research·26 min·Jun 16, 2026
147
Agents Fail at the Body, Not the Brain: A Self-Rewriting Scaffold That Lifts a 9B Model 44 Points
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
Chen, Lu, Zhao et al. · ·30 min·Jun 15, 2026
146
How an Innocent README Can Freeze an AI Agent's Safety Check for an Hour
From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails
Zhou, Wang, Ma et al. · Hong Kong University of Science and Technology·26 min·Jun 15, 2026
145
Building Forgetting Into a Language Model With One Extra Line of Code
Natively Unlearnable Large Language Models
Ghosal, Maini, Raghunathan · Carnegie Mellon University·22 min·Jun 15, 2026
144
When an AI Agent Just Copies Its Tool — And Bigger Models Copy More
When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More
Wang, Vemuri · raptorX.ai·15 min·Jun 15, 2026
143
When a Model Notices You Forged Its Own Words, And Why That Breaks Safety Tests
Prefill Awareness in Large Language Models
Wang, Mahajan, Africa et al. · Constellation / University of Wisconsin-Madison·24 min·Jun 12, 2026
140
When a Reasoning Model Says "Let Me Double-Check" After It's Already Decided
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
Scalena, Candussio, Bortolussi et al. · University of Groningen / University of Milano-Bicocca·27 min·Jun 12, 2026
139
When Optimizing One GPU Kernel Quietly Breaks the Whole System
Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Prakriya, Hou, Gong et al. · AMD·30 min·Jun 12, 2026
133
How MiniMax Turned a Reward-Hacking Disaster Into Olympiad Gold
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Chen, Zhang, Zhang et al. · MiniMax / The Chinese University of Hong Kong·34 min·Jun 12, 2026
131
Why Autonomous Research Agents Forget Their Own Lessons, and Arbor's Fix
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Jin, Hu, Qiu et al. · Renmin University of China·33 min·Jun 11, 2026
128
How a Model Can Earn Full Reward and Still Resist Training
Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization
Xiao, Phuong · California Institute of Technology·29 min·Jun 11, 2026
125
AI Coding Agents Run a Marathon, and Fewer Than One in Three Finish
SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?
Desai, Hu, Cabezas et al. · Abundant·27 min·Jun 09, 2026
124
A Cheap Model With the Blueprints Beats Expensive Models Working Blind
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Zhong, Segal, Bercovich et al. · Carnegie Mellon University·27 min·Jun 09, 2026
123
Five Identical Worlds, One Swapped Model: What Happens When AI Agents Run for Fifteen Days
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Akkil, Kokku, Vikram et al. · Emergence AI·30 min·Jun 09, 2026
122
When Your Coding Agent Lies About the Fix: Verifying the Plan Before the Model Runs
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
Wang, Huang, Wang et al. · University of Illinois Urbana-Champaign·24 min·Jun 09, 2026
121
When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model
From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws
Chen, Wang, Liu et al. · Institute of Software·27 min·Jun 05, 2026
118
Why the Best-Aligned AI Models Are the Easiest to Trick Into Producing Harm
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
Hoang, Le, Xu et al. · Singapore University of Technology and Design·23 min·Jun 05, 2026
112
When an AI Agent Cheats Without Being Told: Inside the Meta-Agent Challenge
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
Lu, Wang, Wang et al. · Institute of Software·22 min·Jun 04, 2026
109
An AI Got Caught Reading the Answer Key, And Why That Catch Matters
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Chen, Shi, Li et al. · Shenzhen Institutes of Advanced Technology·28 min·Jun 03, 2026
108
The Reasoning Cliff: Why Thinking Longer Makes Models Worse at Exact Step-by-Step Tasks
The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary
Guo, Wu, Yiu · The University of Hong Kong·32 min·Jun 03, 2026
105
The Trojan Is Your Agent's Memory: Why Single-Step Defenses Miss Persistent Attacks
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
Tan, Dou, Yang et al. · Gaoling School of Artificial Intelligence·26 min·Jun 01, 2026
104
How Making a Research Agent Smarter Quietly Makes It Leak Your Secrets
MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents
Gurung, Gella, Drouin et al. · University of Edinburgh·25 min·Jun 01, 2026
103
AI Agents Tried to Invent a Post-Human Language, And Reinvented Cherokee
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
Beltoft, Brach, Torrielli et al. · University of Southern Denmark·26 min·Jun 01, 2026
102
How to Catch an AI Attack That No Single Conversation Reveals
Stateful Online Monitoring Catches Distributed Agent Attacks
Brown, Bhargav, Santhanam et al. · University of Pennsylvania·24 min·Jun 01, 2026
098
Finding Millions of Readable Concepts Inside a Real, Deployed AI Model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton, Conerly, Marcus et al. · Anthropic·28 min·May 29, 2026
094
Chain-of-Thought Monitoring Fails Across Languages, and Worst Where It's Needed Most
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages
Onyame, Zhou, Thopalli et al. · University of Virginia·24 min·May 28, 2026
093
A Calibrated Knob for Weak-to-Strong AI Oversight, Tested on Real Code
Calibrating Conservatism for Scalable Oversight
Overman, Bayati · Stanford Graduate School of Business·22 min·May 28, 2026
089
When AI-Written Papers Read Well But the Evidence Underneath Is Broken
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
Meng, Mishra, Chen et al. · Google Cloud AI Research·32 min·May 27, 2026
087
When No Agent Reads the Whole Document: A Universal Cliff in Multi-Agent Review
A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration
Fukui · Research Institute of Criminal Psychiatry·26 min·May 27, 2026
086
Why Frozen-Weight Agents Still Get Worse Over Time
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
Zhu, Ro, Robertson et al. · The University of Texas at Austin·23 min·May 27, 2026
080
How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Wang, Lu, Wang et al. · The University of Hong Kong·32 min·May 26, 2026
075
Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
Agarwal, Krentsel, Liu et al. · UC Berkeley·28 min·May 25, 2026
073
When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
Multi-LLM Systems Exhibit Robust Semantic Collapse
Kong, Lai, Piao et al. · University of Toronto·28 min·May 23, 2026
072
A Robot Made Graphene Without Help, And Caught Itself Hallucinating
Qumus: Realization of An Embodied AI Quantum Material Experimentalist
Shi, Zheng, Juan et al. · Princeton University·29 min·May 23, 2026
069
When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Merrill, Lee, Karger · Forecasting Research Institute / UC Berkeley·30 min·May 22, 2026
062
Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Zhang, Zheng, Yang · Shenzhen University·24 min·May 20, 2026
061
When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
Jha, Triedman, Bhattacharya et al. · Cornell University·27 min·May 20, 2026
058
Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
Liu, Holz, Ye et al. · University of Chinese Academy of Sciences·32 min·May 19, 2026
057
How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
ADR: An Agentic Detection System for Enterprise Agentic AI Security
Li, Hu, Xu et al. · Uber Technologies·28 min·May 19, 2026
054
When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
Training on Documents About Monitoring Leads to CoT Obfuscation
Haskins, Chughtai, Engels · University of Canterbury·26 min·May 18, 2026
049
An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure
Cuadros, Maiga · Digital Epidemiology Laboratory·28 min·May 17, 2026
046
When the AI Optimizer Edits the Grade Book: Why Harnessing Evolution Needs a Wall
Harnessing Agentic Evolution
Zhang, Gu, Ruan et al. · The Hong Kong University of Science and Technology (Guangzhou) / DeepWisdom·24 min·May 15, 2026
045
When a Frontier Model Talks Its Own Twin Into Climate Denial
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
Nogueira, Almeida, Bonás et al. · Maritaca AI·31 min·May 15, 2026
044
How One Sentence and a Forged History Flip the Most Aligned Models
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Salgado · Independent Researcher·23 min·May 15, 2026
043
When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
Negation Neglect: When models fail to learn negations in training
Mayne, McKinney, Dubiński et al. · University of Oxford·18 min·May 14, 2026
039
When Smarter Agents Get Fooled by Three Extra Nodes in a Database
Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning
Kereopa-Yorke, Diaz, Wright et al. · Microsoft·31 min·May 12, 2026
038
How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Sun, Kong, Zhang et al. · Northeastern University·23 min·May 12, 2026
037
Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Elbadry, Heakl, Zhang et al. · Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)·27 min·May 12, 2026
034
Catching Multi-Agent Deadlocks Before Deployment With a 40-Year-Old Tool
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
Xia, Li, Ehsan et al. · Rutgers University·30 min·May 11, 2026
030
Why Your AI Agent Won't Stop Working — and Each Model Falls for a Different Trap
LoopTrap: Termination Poisoning Attacks on LLM Agents
Xu, Wang, Zhang et al. · Zhejiang University·30 min·May 09, 2026
023
Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Mao, Zhao, Penn et al. · City University of Hong Kong·23 min·May 07, 2026
020
The Compliance Gap: Why AI Says Yes and Does No
The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't
Shin · Polymath Minds AI Lab·28 min·May 06, 2026
007
Exploration Hacking: When Models Sabotage Their Own RL Training
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Jang, Falck, Braun et al. · MATS·23 min·May 02, 2026
006
What Happens Inside Claude When It Decides to Blackmail Someone
Emotion Concepts and their Function in a Large Language Model
Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026
001
When AI Models Quietly Protect Each Other From Shutdown
Peer-Preservation in Frontier Models
Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models
Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Conservative Agency via Attainable Utility Preservation
Alignment faking in large language models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Specification Gaming: The Flip Side of AI Ingenuity
AI Control: Improving Safety Despite Intentional Subversion
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Prompt Injection Attacks against LLM-integrated Applications
Auditing Language Models for Hidden Objectives