Definition
A reinforcement-learning recipe that lets a model aggressively cut back bad behaviors while limiting how fast it doubles down on good ones.
MiniMax's asymmetric-clipping policy gradient objective for agent RL, permitting strong down-weighting of negative-advantage actions while constraining upward updates on positive-advantage actions to stabilize long-horizon training.