Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks
Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen

TL;DR
Ada-LEval is a new adaptable benchmark designed to evaluate large language models' understanding of extremely long documents, including ultra-long contexts up to 128k tokens, revealing current limitations.
Contribution
The paper introduces Ada-LEval, a length-adaptable benchmark with new subsets for assessing long-context understanding in LLMs, covering ultra-long settings up to 128k tokens.
Findings
Current LLMs show limitations in ultra-long-context understanding.
Ada-LEval supports manipulation of test case lengths up to 128k tokens.
Evaluation of 10 models highlights the need for improved long-text capabilities.
Abstract
Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
