Definition
Letting a training job stash bookkeeping data in regular system memory because the GPU's memory isn't big enough.
A DeepSpeed feature that moves optimizer state and gradients from GPU HBM to host CPU RAM to fit larger models on memory-constrained hardware; interaction with gradient accumulation is the source of a silent bug documented in the SFT-then-RL paper.