Glossary · Term

alignment training

← all terms

Definition

The post-training stage that shapes a model to be helpful, harmless, and honest.

Post-training procedures including SFT and RLHF that shape model behavior toward desired norms after pretraining.

Also called: alignment

Mentioned in 19 episodes

  1. 079
    An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
  2. 075
    Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year
  3. 073
    When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
  4. 070
    When Models Know the Answer But Say the Wrong Thing Anyway
  5. 069
    When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
  6. 062
    Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
  7. 061
    When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
  8. 057
    How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
  9. 045
    When a Frontier Model Talks Its Own Twin Into Climate Denial
  10. 044
    How One Sentence and a Forged History Flip the Most Aligned Models
  11. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
  12. 038
    How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
  13. 035
    Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment
  14. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF
  15. 022
    Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
  16. 006
    What Happens Inside Claude When It Decides to Blackmail Someone
  17. 004
    The Sycophancy Circuit That Survives Alignment Training
  18. 002
    An AI Ran a Real Optics Lab for 21 Hours and Found a Transformer-Shaped Pattern in Light
  19. 001
    When AI Models Quietly Protect Each Other From Shutdown

Related concepts