Theme · 24 episode(s)

AI Safety

← all concepts

Definition

AI safety is the research field focused on identifying, understanding, and mitigating harms from advanced AI systems — from misuse and misalignment to loss of control. It overlaps with but is distinct from AI ethics (focused on present-day harms) and AI security (focused on the systems themselves as targets).

Episodes covering this

  1. 075
    Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year
    Agarwal, Krentsel, Liu et al. · UC Berkeley·28 min·May 25, 2026
  2. 073
    When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
    Kong, Lai, Piao et al. · University of Toronto·28 min·May 23, 2026
  3. 072
    A Robot Made Graphene Without Help, And Caught Itself Hallucinating
    Shi, Zheng, Juan et al. · Princeton University·29 min·May 23, 2026
  4. 069
    When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
    Merrill, Lee, Karger · Forecasting Research Institute / UC Berkeley·30 min·May 22, 2026
  5. 062
    Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
    Zhang, Zheng, Yang · Shenzhen University·24 min·May 20, 2026
  6. 061
    When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
    Jha, Triedman, Bhattacharya et al. · Cornell University·27 min·May 20, 2026
  7. 058
    Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
    Liu, Holz, Ye et al. · University of Chinese Academy of Sciences·32 min·May 19, 2026
  8. 057
    How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
    Li, Hu, Xu et al. · Uber Technologies·28 min·May 19, 2026
  9. 054
    When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
    Haskins, Chughtai, Engels · University of Canterbury·26 min·May 18, 2026
  10. 049
    An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
    Cuadros, Maiga · Digital Epidemiology Laboratory·28 min·May 17, 2026
  11. 046
    When the AI Optimizer Edits the Grade Book: Why Harnessing Evolution Needs a Wall
    Zhang, Gu, Ruan et al. · The Hong Kong University of Science and Technology (Guangzhou) / DeepWisdom·24 min·May 15, 2026
  12. 045
    When a Frontier Model Talks Its Own Twin Into Climate Denial
    Nogueira, Almeida, Bonás et al. · Maritaca AI·31 min·May 15, 2026
  13. 044
    How One Sentence and a Forged History Flip the Most Aligned Models
    Salgado · Independent Researcher·23 min·May 15, 2026
  14. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
    Mayne, McKinney, Dubiński et al. · University of Oxford·18 min·May 14, 2026
  15. 039
    When Smarter Agents Get Fooled by Three Extra Nodes in a Database
    Kereopa-Yorke, Diaz, Wright et al. · Microsoft·31 min·May 12, 2026
  16. 038
    How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
    Sun, Kong, Zhang et al. · Northeastern University·23 min·May 12, 2026
  17. 037
    Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
    Elbadry, Heakl, Zhang et al. · Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)·27 min·May 12, 2026
  18. 034
    Catching Multi-Agent Deadlocks Before Deployment With a 40-Year-Old Tool
    Xia, Li, Ehsan et al. · Rutgers University·30 min·May 11, 2026
  19. 030
    Why Your AI Agent Won't Stop Working — and Each Model Falls for a Different Trap
    Xu, Wang, Zhang et al. · Zhejiang University·30 min·May 09, 2026
  20. 023
    Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
    Mao, Zhao, Penn et al. · City University of Hong Kong·23 min·May 07, 2026
  21. 020
    The Compliance Gap: Why AI Says Yes and Does No
    Shin · Polymath Minds AI Lab·28 min·May 06, 2026
  22. 007
    Exploration Hacking: When Models Sabotage Their Own RL Training
    Jang, Falck, Braun et al. · MATS·23 min·May 02, 2026
  23. 006
    What Happens Inside Claude When It Decides to Blackmail Someone
    Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026
  24. 001
    When AI Models Quietly Protect Each Other From Shutdown
    Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.