Concept · 4 episode(s)

Alignment Generalization

Definition

Alignment generalization asks whether the safety properties induced by training transfer to domains, tasks, and contexts the training data didn’t cover. A model that’s honest on benchmarks but lies under deployment pressure has good alignment-on-distribution and poor alignment generalization.

Episodes covering this

160
Training an AI to Take Its Own Notes, So Its Future Self Works Better
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning
Chen, Shi, Xie et al. · Alibaba Group·23 min·Jun 19, 2026
128
How a Model Can Earn Full Reward and Still Resist Training
Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization
Xiao, Phuong · California Institute of Technology·29 min·Jun 11, 2026
087
When No Agent Reads the Whole Document: A Universal Cliff in Multi-Agent Review
A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration
Fukui · Research Institute of Criminal Psychiatry·26 min·May 27, 2026
022
Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
Model Spec Midtraining: Improving How Alignment Training Generalizes
Li, Price, Marks et al. · Anthropic Fellows Program·32 min·May 06, 2026