Definition
A multiple-choice common-sense reasoning benchmark widely used to evaluate small language models.
A common-sense natural language inference benchmark where models pick the most plausible continuation of a short scenario; a standard zero-shot evaluation for pretrained models.