Concept · 1 episode(s)

Gradient Accumulation

Definition

Gradient accumulation trains as if you had a larger batch size by adding gradients across several smaller forward/backward passes before stepping the optimizer. It’s the standard trick for fitting big-batch training onto small-memory hardware.

Episodes covering this

009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Limozin, Durech, Hoefler et al. · ETH AI Center·23 min·May 02, 2026