Definition
Gradient accumulation trains as if you had a larger batch size by adding gradients across several smaller forward/backward passes before stepping the optimizer. It’s the standard trick for fitting big-batch training onto small-memory hardware.