Glossary · Term

gradient accumulation

Definition

Plain language

Adding up gradients from several small batches before updating the model, as a workaround when one big batch won't fit in memory.

As stated in the literature

A training technique that processes mini-batches sequentially and sums their gradients before a single optimizer step, simulating a larger effective batch on memory-constrained hardware.

Why it matters: It lets small labs train with large effective batch sizes on modest hardware, which is often the difference between a stable training run and a noisy one.

For example, if a batch of 256 examples won't fit in GPU memory but 32 will, you process eight mini-batches of 32, add their gradients up, and then take one step as if you'd used 256.

Heard on the show

“To run a 7-billion-parameter model on academic hardware, you need tricks — and one of the standard tricks is gradient accumulation.”

Episode 009 — How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Mentioned in 1 episode

009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Related terms

gradient