Definition
A way to split a huge model across many GPUs so it fits in memory during training.
Fully Sharded Data Parallel, a distributed training scheme that shards parameters, gradients, and optimizer state across GPUs.
A way to split a huge model across many GPUs so it fits in memory during training.
Fully Sharded Data Parallel, a distributed training scheme that shards parameters, gradients, and optimizer state across GPUs.