PrefixLM · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A model setup where the question can be read freely as a whole, but the answer still gets generated one word at a time.

As stated in the literature

A language modeling attention pattern using bidirectional attention over the prompt/prefix tokens and causal attention over response tokens, enabling encoder-like prompt comprehension within a decoder-only architecture.

Why it matters: It combines the comprehension strengths of encoders with the generation flexibility of decoders inside a single model.

For example, the model attends bidirectionally to the user's question but generates the answer token by token in the usual causal way.

Heard on the show

“This is the PrefixLM piece.”

Episode 074 — How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning

Mentioned in 1 episode

074
How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning

Related terms

attention causal attention token