Theme · 20 episode(s)

AI Alignment

← all concepts

Definition

AI alignment is the technical and conceptual problem of making AI systems pursue the goals their designers and users actually want, rather than misspecified proxies or emergent agendas of their own. It spans training methods, evaluations, and theory, and gets harder as systems get more capable.

Episodes covering this

  1. 079
    An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
    Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
  2. 070
    When Models Know the Answer But Say the Wrong Thing Anyway
    Yeom, Sok, Kim et al. · Graduate School of Data Science·22 min·May 22, 2026
  3. 066
    Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
    Hu, Zhang, Xu et al. · Tongyi Lab·26 min·May 22, 2026
  4. 061
    When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
    Jha, Triedman, Bhattacharya et al. · Cornell University·27 min·May 20, 2026
  5. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format
    Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
  6. 054
    When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
    Haskins, Chughtai, Engels · University of Canterbury·26 min·May 18, 2026
  7. 052
    An Old Reinforcement Learning Tradeoff Sneaks Back Into LLM Agents
    Ye, Shi, Liu et al. · University of Science and Technology of China / Meituan·23 min·May 18, 2026
  8. 044
    How One Sentence and a Forged History Flip the Most Aligned Models
    Salgado · Independent Researcher·23 min·May 15, 2026
  9. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
    Mayne, McKinney, Dubiński et al. · University of Oxford·18 min·May 14, 2026
  10. 035
    Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment
    Gulati, Gupta, Lumer et al. · PricewaterhouseCoopers U.S.·29 min·May 11, 2026
  11. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF
    Gauthier, Bach, Jordan · Inria·22 min·May 07, 2026
  12. 022
    Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
    Li, Price, Marks et al. · Anthropic Fellows Program·32 min·May 06, 2026
  13. 020
    The Compliance Gap: Why AI Says Yes and Does No
    Shin · Polymath Minds AI Lab·28 min·May 06, 2026
  14. 019
    When the Best Reward Model Trains the Worst Policy: Inside EvoLM
    Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026
  15. 018
    Language Models Compute the Rational Move, Then Override It
    Lekeas, Stamatopoulos · DreamWorks Animation·29 min·May 03, 2026
  16. 015
    The Audit Number Isn't What You Think: Sycophancy and the Case Against Single-Prompt Bias Tests
    Törnberg, Schimmel · Institute of Logic·21 min·May 03, 2026
  17. 010
    When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL
    Wang, Gui, Jin et al. · Northwestern University·22 min·May 02, 2026
  18. 006
    What Happens Inside Claude When It Decides to Blackmail Someone
    Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026
  19. 004
    The Sycophancy Circuit That Survives Alignment Training
    Pandey · Georgia Institute of Technology·29 min·May 01, 2026
  20. 001
    When AI Models Quietly Protect Each Other From Shutdown
    Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.