Definition
A benchmark suite for testing how reliably a model can find specific information buried in long inputs.
A long-context evaluation suite covering needle-in-haystack, multi-key, multi-value, and aggregation tasks under varying context lengths.
Mentioned in 2 episodes
036
033