$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens
Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu and, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang and, Zhiyuan Liu, Maosong Sun

TL;DR
This paper introduces $ abla$Bench, a new benchmark with over 100K tokens to evaluate LLMs' ability to process extremely long contexts, revealing current models' limitations and guiding future improvements.
Contribution
It presents the first long-context benchmark exceeding 100K tokens, including diverse tasks in English and Chinese, to evaluate and compare LLMs' long dependency understanding.
Findings
Existing long-context LLMs need significant improvements.
Current models struggle with 100K+ token contexts.
Analysis reveals specific behaviors in LLMs processing long contexts.
Abstract
Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose Bench, the first LLM benchmark featuring an average data length surpassing 100K tokens. Bench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in Bench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗namespace-Pt/Llama-3-8B-Instruct-80K-QLoRAmodel· ♡ 24♡ 24
- 🤗namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Mergedmodel· 12 dl· ♡ 1212 dl♡ 12
- 🤗namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged-GGUFmodel· 25 dl· ♡ 425 dl♡ 4
- 🤗dhruvabansal/Llama-3-8B-Instruct-80K-QLoRAmodel
- 🤗aws-prototyping/MegaBeam-Mistral-7B-300k-AWQmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗QuantFactory/Llama-3-8B-ProLong-64k-Instruct-GGUFmodel· 268 dl· ♡ 1268 dl♡ 1
- 🤗RichardErkhov/princeton-nlp_-_Llama-3-8B-ProLong-64k-Base-ggufmodel· 286 dl286 dl
- 🤗RichardErkhov/namespace-Pt_-_Llama-3-8B-Instruct-80K-QLoRA-Merged-ggufmodel· 18 dl18 dl
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Algorithms and Data Compression · Image Retrieval and Classification Techniques
MethodsFocus
