Definition
Reinforcement learning post-training applies RL to an already-pretrained language model, optimizing it for some reward signal — preferences, task success, verifier scores. It’s the post-training stage that gives modern reasoning models most of their behavioral character.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.