A Controllable Examination for Long-Context Language Models
Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z.Pan, Ivan Titov

TL;DR
This paper introduces LongBioBench, a controllable and interpretable benchmark using artificially generated biographies to evaluate long-context language models across understanding, reasoning, and trustworthiness, addressing limitations of existing benchmarks.
Contribution
The study presents LongBioBench, a novel benchmark that offers a controllable, sound, and interpretable environment for evaluating long-context language models, improving upon existing synthetic benchmarks.
Findings
Most models show deficiencies in semantic understanding and reasoning.
Model trustworthiness decreases as context length increases.
Existing benchmarks are vulnerable due to non-coherence and lack of distractors.
Abstract
Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world applications (e.g, document summarization) and synthetic tasks (e.g, needle-in-a-haystack). Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks often involve complexity that makes interpretation challenging and suffer from data contamination, whereas synthetic tasks frequently lack meaningful coherence between the target information (needle) and its surrounding context (haystack), undermining their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: 1) seamless context 2) controllable setting and 3) sound evaluation. This study introduces , a benchmark that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
