Definition
A model setup where the question can be read freely as a whole, but the answer still gets generated one word at a time.
A language modeling attention pattern using bidirectional attention over the prompt/prefix tokens and causal attention over response tokens, enabling encoder-like prompt comprehension within a decoder-only architecture.