Definition
AI alignment is the technical and conceptual problem of making AI systems pursue the goals their designers and users actually want, rather than misspecified proxies or emergent agendas of their own. It spans training methods, evaluations, and theory, and gets harder as systems get more capable.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- Representation Engineering: A Top-Down Approach to AI Transparency
- Constitutional AI: Harmlessness from AI Feedback
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- Alignment faking in large language models
- Fine-tuning aligned language models compromises safety, even when users are not the ones fine-tuning
- Risks from Learned Optimization in Advanced Machine Learning Systems