Glossary · Term

teacher forcing

Definition

Plain language

During training, feeding the model the correct earlier tokens instead of letting it use its own predictions.

As stated in the literature

A training strategy where the model conditions on ground-truth previous tokens rather than its own outputs, enabling parallel training but creating exposure bias.

Why it matters: It enables fast parallel training but causes models to behave worse at inference, when they must rely on their own outputs — a gap known as exposure bias.

For example, when training a translation model, the decoder sees the correct prior words of the target sentence at each step rather than its own (possibly wrong) earlier guesses.

Mentioned in 1 episode

032
A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking

Related terms

ground truth token