$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

Xinrong Zhang; Yingfa Chen; Shengding Hu; Zihang Xu and; Junhao Chen; Moo Khai Hao; Xu Han; Zhen Leng Thai; Shuo Wang and; Zhiyuan Liu; Maosong Sun

arXiv:2402.13718·cs.CL·February 27, 2024·2 cites

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu and, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang and, Zhiyuan Liu, Maosong Sun

PDF

Open Access 4 Repos 8 Models 2 Videos

TL;DR

This paper introduces $ abla$Bench, a new benchmark with over 100K tokens to evaluate LLMs' ability to process extremely long contexts, revealing current models' limitations and guiding future improvements.

Contribution

It presents the first long-context benchmark exceeding 100K tokens, including diverse tasks in English and Chinese, to evaluate and compare LLMs' long dependency understanding.

Findings

01

Existing long-context LLMs need significant improvements.

02

Current models struggle with 100K+ token contexts.

03

Analysis reveals specific behaviors in LLMs processing long contexts.

Abstract

Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose $\infty$ Bench, the first LLM benchmark featuring an average data length surpassing 100K tokens. $\infty$ Bench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in $\infty$ Bench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results· youtube

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens· underline

Taxonomy

TopicsVideo Analysis and Summarization · Algorithms and Data Compression · Image Retrieval and Classification Techniques

MethodsFocus