HellaSwag · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A multiple-choice common-sense reasoning benchmark widely used to evaluate small language models.

As stated in the literature

A common-sense natural language inference benchmark where models pick the most plausible continuation of a short scenario; a standard zero-shot evaluation for pretrained models.

Why it matters: It's a quick, cheap zero-shot probe for whether a pretrained model has picked up everyday common-sense patterns from its training data.

For example, a HellaSwag question gives a setup like 'A woman is sitting at a piano. She…' and asks the model to pick the most plausible next sentence from four options.

Heard on the show

“And the absolute scores we're talking about — HellaSwag in the low forties, LAMBADA around forty — they're in a regime where the gaps between architectures are small and benchmark noise is real.”

Episode 033 — Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval

Mentioned in 1 episode

033
Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval

Related terms

inference pretraining zero-shot