token · Glossary · AI Papers: A Deep Dive

Definition

Plain language

The basic unit of text a language model reads or writes — roughly a word or part of a word.

As stated in the literature

A discrete unit from a tokenizer's vocabulary, often a subword piece, that language models consume and produce one at a time.

Also called: tokens

Why it matters: Tokenization shapes context length limits, billing, and even how well models handle numbers, code, and non-English languages.

For example, the word 'unbelievable' might be split into the tokens 'un', 'believ', and 'able' before the model sees it.

“The word is built to discard variation so we can all agree on one token.”

tokenizer