BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry, Sorokin, Artyom Sorokin, Mikhail Burtsev

TL;DR
BABILong is a comprehensive benchmark designed to evaluate large language models' reasoning abilities across extremely long contexts, revealing current models' limitations and highlighting the effectiveness of certain memory-augmented approaches.
Contribution
This work introduces the BABILong benchmark for assessing LLMs on long-context reasoning tasks and provides extensive evaluation results highlighting current models' performance gaps.
Findings
Popular LLMs utilize only 10-20% of context in reasoning tasks.
Performance drops sharply as reasoning complexity increases.
Recurrent memory transformers fine-tuned can process up to 50 million tokens.
Abstract
In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Artificial Intelligence in Law
MethodsSparse Evolutionary Training
