A Controllable Examination for Long-Context Language Models

Yijun Yang; Zeyu Huang; Wenhao Zhu; Zihan Qiu; Fei Yuan; Jeff Z.Pan; Ivan Titov

arXiv:2506.02921·cs.CL·October 21, 2025

A Controllable Examination for Long-Context Language Models

Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z.Pan, Ivan Titov

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces LongBioBench, a controllable and interpretable benchmark using artificially generated biographies to evaluate long-context language models across understanding, reasoning, and trustworthiness, addressing limitations of existing benchmarks.

Contribution

The study presents LongBioBench, a novel benchmark that offers a controllable, sound, and interpretable environment for evaluating long-context language models, improving upon existing synthetic benchmarks.

Findings

01

Most models show deficiencies in semantic understanding and reasoning.

02

Model trustworthiness decreases as context length increases.

03

Existing benchmarks are vulnerable due to non-coherence and lack of distractors.

Abstract

Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world applications (e.g, document summarization) and synthetic tasks (e.g, needle-in-a-haystack). Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks often involve complexity that makes interpretation challenging and suffer from data contamination, whereas synthetic tasks frequently lack meaningful coherence between the target information (needle) and its surrounding context (haystack), undermining their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: 1) seamless context 2) controllable setting and 3) sound evaluation. This study introduces $LongBioBench$ , a benchmark that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

thomasyyj/LongBioBench_Sample
dataset· 24 dl
24 dl

Videos

A Controllable Examination for Long-Context Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques