Definition
One of many small specialists inside a transformer that decides which earlier tokens to focus on.
One of multiple parallel sub-units in a transformer attention layer, each computing its own query-key-value projection over earlier tokens.
Also called: attention heads, heads