Definition
A speed trick where a small draft model proposes several tokens at once and a big model verifies them in parallel.
An inference acceleration technique where a draft model proposes a sequence of tokens that a larger target model verifies in a batched forward pass.
Also called: speculative decoder, predicted outputs