Definition
A benchmark suite for testing how well models handle long documents.
A multi-task evaluation suite for long-context understanding spanning summarization, retrieval, and reasoning over extended inputs.
A benchmark suite for testing how well models handle long documents.
A multi-task evaluation suite for long-context understanding spanning summarization, retrieval, and reasoning over extended inputs.