Definition
A classic reinforcement-learning rule where you push the model toward whatever it did when it won, and away from what it did when it lost.
A Monte Carlo policy-gradient algorithm that updates parameters in the direction of log-probability of sampled actions weighted by return.