BABILong: Testing the Limits of LLMs with Long Context   Reasoning-in-a-Haystack

Yuri Kuratov; Aydar Bulatov; Petr Anokhin; Ivan Rodkin; Dmitry; Sorokin; Artyom Sorokin; Mikhail Burtsev

arXiv:2406.10149·cs.CL·November 7, 2024·6 cites

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry, Sorokin, Artyom Sorokin, Mikhail Burtsev

PDF

Open Access 4 Repos 1 Models 3 Datasets 1 Video

TL;DR

BABILong is a comprehensive benchmark designed to evaluate large language models' reasoning abilities across extremely long contexts, revealing current models' limitations and highlighting the effectiveness of certain memory-augmented approaches.

Contribution

This work introduces the BABILong benchmark for assessing LLMs on long-context reasoning tasks and provides extensive evaluation results highlighting current models' performance gaps.

Findings

01

Popular LLMs utilize only 10-20% of context in reasoning tasks.

02

Performance drops sharply as reasoning complexity increases.

03

Recurrent memory transformers fine-tuned can process up to 50 million tokens.

Abstract

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
msj19/opencompass
model

Datasets

Videos

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Artificial Intelligence in Law

MethodsSparse Evolutionary Training