SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning
Renxi Wang, Honglin Mu, Liqun Ma, Lizhi Lin, Yunlong Feng, Timothy Baldwin, Xudong Han, Haonan Li

TL;DR
SCALAR is a benchmark for evaluating large language models' ability to perform citation-grounded reasoning over long academic texts, using automatically generated labels and multiple tasks to measure progress.
Contribution
It introduces a novel, automatically labeled benchmark with controllable difficulty and dynamic updates for assessing citation-based reasoning in LLMs.
Findings
State-of-the-art models perform poorly compared to humans.
Multiple-choice task effectively differentiates model capabilities.
Cloze-style task remains highly challenging for current models.
Abstract
Long-context understanding has emerged as a critical capability for large language models (LLMs). However, evaluating this ability remains challenging. We present SCALAR, a benchmark designed to assess citation-grounded long-context reasoning in academic writing. SCALAR leverages academic papers and their citation structure to automatically generate high-quality ground-truth labels without human annotation. It features controllable difficulty levels and a dynamic updating mechanism that mitigates data contamination. The benchmark includes two tasks: a multiple-choice QA format and a cloze-style citation prediction. We evaluate a range of state-of-the-art LLMs and find that the multiple-choice task effectively distinguishes model capabilities. While human experts achieve over 90% accuracy, most models struggle. The cloze-style task is even more challenging, with no model exceeding 50%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsOnline Learning and Analytics · Intelligent Tutoring Systems and Adaptive Learning
