reinforcement learning with verifiable rewards · Glossary

Definition

Plain language

Training a model by checking if its final answer is correct on tasks where you can mechanically verify the answer.

As stated in the literature

RLVR, an RL training paradigm using only verifiable scalar correctness signals (e.g., from calculators, compilers, formal verifiers); foundation of DeepSeek-R1 style reasoning training.

Also called: RLVR

Why it matters: Because the reward is mechanically checkable, training can scale without bottlenecking on human labelers — but it only works in domains where correctness is decidable.

For example, a math-reasoning model gets a +1 signal whenever its final answer matches the known solution and 0 otherwise, with no human in the loop.

Heard on the show

“The second is RLVR — reinforcement learning with verifiable rewards.”

Episode 163 — Why Training Only on Perfect Solutions Cripples a Model's Reasoning

Mentioned in 2 episodes

Related terms

DeepSeek reinforcement learning RLVR verifier