Glossary · Term

RLVR

Definition

Plain language

Training a model by checking if its final answer is correct on tasks where you can mechanically verify the answer.

As stated in the literature

Reinforcement Learning with Verifiable Rewards — an RL paradigm using only verifiable scalar correctness signals, foundation of DeepSeek-R1 style reasoning training.

Also called: reinforcement learning with verifiable rewards

Why it matters: Because verification is mechanical, you can run RLVR at huge scale on math, code, and proofs — but the approach offers little traction in domains where "correct" is fuzzy.

For example, a coding model receives reward +1 only when its generated program passes every unit test, and 0 otherwise.

Heard on the show

“The second is RLVR — reinforcement learning with verifiable rewards.”

Episode 163 — Why Training Only on Perfect Solutions Cripples a Model's Reasoning

Mentioned in 3 episodes

163
Why Training Only on Perfect Solutions Cripples a Model's Reasoning
080
How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models

Related terms

DeepSeek reinforcement learning