CPU offloading · Glossary · AI Papers: A Deep Dive

Definition

Plain language

Letting a training job stash bookkeeping data in regular system memory because the GPU's memory isn't big enough.

As stated in the literature

A DeepSpeed feature that moves optimizer state and gradients from GPU HBM to host CPU RAM to fit larger models on memory-constrained hardware; interaction with gradient accumulation is the source of a silent bug documented in the SFT-then-RL paper.

Why it matters: It expands what hardware can train large models, but its interaction with other features like gradient accumulation can introduce silent correctness bugs.

For example, training a 70B model on a single GPU only fits if the optimizer state lives in CPU RAM and is streamed over to the GPU when needed.

Heard on the show

“DeepSpeed has a feature called CPU offloading.”

Episode 009 — How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Mentioned in 1 episode

009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Related terms

CPU DeepSpeed feature gradient gradient accumulation reinforcement learning SFT