Definition
Policy gradient methods train a policy directly by following the gradient of expected reward with respect to its parameters, rather than learning a value function and acting greedy. REINFORCE, PPO, and GRPO are all variants tuned for different variance/bias trade-offs.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.