Definition
A reinforcement-learning method that nudges a model's behavior toward actions that scored higher.
A family of RL algorithms that estimate gradients of expected return with respect to policy parameters by sampling trajectories.
Also called: policy-gradient