Concept · 12 episode(s)

GRPO

← all concepts

Definition

GRPO (Group Relative Policy Optimization) is a policy-gradient method that estimates advantage by comparing a sampled response to the average of a group of other samples from the same prompt, removing the need for a separate value model. It became a popular RL-from-rewards variant in reasoning-model training pipelines.

Episodes covering this

  1. 079
    An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
    Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
  2. 073
    When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
    Kong, Lai, Piao et al. · University of Toronto·28 min·May 23, 2026
  3. 066
    Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
    Hu, Zhang, Xu et al. · Tongyi Lab·26 min·May 22, 2026
  4. 064
    When Agent Memory Stops Being a Database and Starts Being a Skill
    Ye, Liu, Wang et al. · University of Illinois Urbana-Champaign·30 min·May 22, 2026
  5. 059
    Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
    Lu, Wang, Lu et al. · Northeastern University·22 min·May 20, 2026
  6. 052
    An Old Reinforcement Learning Tradeoff Sneaks Back Into LLM Agents
    Ye, Shi, Liu et al. · University of Science and Technology of China / Meituan·23 min·May 18, 2026
  7. 051
    Why Parallel Sampling Plateaus, And What Evidence Graphs Do Instead
    Zhang, Su, Chen et al. · MiroMind AI·22 min·May 18, 2026
  8. 048
    How a 30B Open Model Reached Olympiad Gold With the Right Recipe
    Li, Zhan, Zhang et al. · Shanghai AI Laboratory / The Chinese University of Hong Kong·31 min·May 16, 2026
  9. 047
    When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
    Peng, Yao, Wu et al. · Microsoft Research·28 min·May 15, 2026
  10. 026
    What RL Actually Does to Language Models, at the Token Level
    Akgül, Kannan, Neiswanger et al. · University of Southern California·24 min·May 08, 2026
  11. 011
    When RL Actually Teaches Agents Something New, And When It Doesn't
    Zhai, Yan, Shao et al. · Fudan University·23 min·May 02, 2026
  12. 007
    Exploration Hacking: When Models Sabotage Their Own RL Training
    Jang, Falck, Braun et al. · MATS·23 min·May 02, 2026