Definition
A reinforcement-learning method that compares several attempts at the same task to figure out which ones to reinforce.
Decoupled Adaptive Policy Optimization, a GRPO-family RL algorithm used as the optimizer in MaR-style metacognitive reward training.