Definition
KL divergence measures how far one probability distribution sits from another — asymmetrically, in nats or bits of surprise. It’s a foundational tool in ML for everything from VAEs to RLHF, where it’s used to keep a fine-tuned policy from drifting too far from a reference.