Glossary · Term

CISPO

← all terms

Definition

A reinforcement-learning recipe that lets a model aggressively cut back bad behaviors while limiting how fast it doubles down on good ones.

MiniMax's asymmetric-clipping policy gradient objective for agent RL, permitting strong down-weighting of negative-advantage actions while constraining upward updates on positive-advantage actions to stabilize long-horizon training.

Mentioned in 1 episode

  1. 090
    How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents