LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, Yu Wang

TL;DR
LV-Eval is a comprehensive long-context benchmark with five levels up to 256k words, designed to evaluate large language models across different context lengths with improved accuracy and reduced bias.
Contribution
This paper introduces LV-Eval, a novel long-context benchmark with multiple length levels and techniques to mitigate bias, enabling more objective evaluation of LLMs at unprecedented context sizes.
Findings
Recent LLMs perform best below 64k context length.
Models show performance degradation with longer contexts and confusing info.
LV-Eval reduces bias and knowledge leakage in long-context evaluation.
Abstract
State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge…
Peer Reviews
Decision·Submitted to ICLR 2025
1. A new dataset that could potentially benefit the community 2. Paper presentation is overall clear 3. The proposed benchmark construction method is overall reasonable
The quality (and usefulness) of proposed benchmark is not fully evaluated. How do state-of-the-art LLMs and human (both domain experts and lay people) perform on this dataset? Does this benchmark capture LLM's full capability (other than knowledge extraction/manipulation capabilities for the QA tasks)? In multiple places in the paper, the authors mention human interventions (e.g., Line 269, “ask human annotators to resolve any conflicts in generated facts”), more analysis/discussions on the hum
1. The dataset design approach with 3 options for confusing evaluated LLMs allows to take a broader look on LLM capabilities in dealing with potentially out-of-distribution content. This could be a substantiall addition to the set of benchmarks for thorough examination of LLMs in the context of their generalization. 2. 5 different length levels allows to precisely point out how model is able (or unable) to recall information from different parts of the context. 3. The inclusion of bilingual data
1. The conducted evaluation of LLMs on the LV-Eval benchmark included only 3 closed-source models. It is even more confusing that among those 3 models 2 are very outdated versions of GPT-3.5 and GPT-4. This paper would immensely benefit from inclusion of at least relatively recent closed-source LLMs, such as GPT-4 with 128k context window (which was released almost a year ago to this date), along with Anthropic Claude, which shows remarkable performance in long-context recall. The argumentation
1. It's got a good range of lengths, which is key for seeing how models handle long texts. 2. It's not just English—it's got Chinese too, so it's more useful for different models. 3. They tried new things to make the test harder and stop models from cheating with common knowledge. 4. The scoring is more focused on the important bits of the answer, which makes it more accurate. 5. They shared all the data and code, which is cool for transparency and building on their work.
1. It's mostly about question-answering, which might not cover everything we need for understanding long texts. 2. Testing some models is pricey, so they couldn't check out all the new ones. 3. They're still relying a lot on people to check the tricky parts, which takes time and can be off. 4. There's a chance models could just learn the test, not actually get better at understanding. 5. Models had a hard time with the confusing stuff, so maybe the test needs more of that.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Computing and Algorithms · Image Enhancement Techniques
