Definition
A benchmark that buries reasoning tasks inside very long natural-language passages.
A long-context evaluation suite extending the bAbI tasks with substantial irrelevant text to stress-test retrieval and reasoning over realistic documents.
A benchmark that buries reasoning tasks inside very long natural-language passages.
A long-context evaluation suite extending the bAbI tasks with substantial irrelevant text to stress-test retrieval and reasoning over realistic documents.