Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Chonghua Wang; Haodong Duan; Songyang Zhang; Dahua Lin; Kai Chen

arXiv:2404.06480·cs.CL·April 11, 2024·1 cites

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

Ada-LEval is a new adaptable benchmark designed to evaluate large language models' understanding of extremely long documents, including ultra-long contexts up to 128k tokens, revealing current limitations.

Contribution

The paper introduces Ada-LEval, a length-adaptable benchmark with new subsets for assessing long-context understanding in LLMs, covering ultra-long settings up to 128k tokens.

Findings

01

Current LLMs show limitations in ultra-long-context understanding.

02

Ada-LEval supports manipulation of test case lengths up to 128k tokens.

03

Evaluation of 10 models highlights the need for improved long-text capabilities.

Abstract

Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-compass/ada-leval
pytorchOfficial

Videos

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques