FSDP · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A way to split a huge model across many GPUs so it fits in memory during training.

As stated in the literature

Fully Sharded Data Parallel, a distributed training scheme that shards parameters, gradients, and optimizer state across GPUs.

Why it matters: It's the standard way most teams train large models on commodity GPU clusters without specialized parallelism libraries.

For example, training a 70B model on eight GPUs is impossible if each one has to hold the full weights, but FSDP shards them so each GPU only stores its slice.

Heard on the show

“It uses a different memory-sharding approach, FSDP, which doesn't have this bug.”

Episode 009 — How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Mentioned in 1 episode

009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Related terms

gradient parameter