SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning

Renxi Wang; Honglin Mu; Liqun Ma; Lizhi Lin; Yunlong Feng; Timothy Baldwin; Xudong Han; Haonan Li

arXiv:2502.13753·cs.CL·January 23, 2026

SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning

Renxi Wang, Honglin Mu, Liqun Ma, Lizhi Lin, Yunlong Feng, Timothy Baldwin, Xudong Han, Haonan Li

PDF

Open Access 1 Repo 1 Video

TL;DR

SCALAR is a benchmark for evaluating large language models' ability to perform citation-grounded reasoning over long academic texts, using automatically generated labels and multiple tasks to measure progress.

Contribution

It introduces a novel, automatically labeled benchmark with controllable difficulty and dynamic updates for assessing citation-based reasoning in LLMs.

Findings

01

State-of-the-art models perform poorly compared to humans.

02

Multiple-choice task effectively differentiates model capabilities.

03

Cloze-style task remains highly challenging for current models.

Abstract

Long-context understanding has emerged as a critical capability for large language models (LLMs). However, evaluating this ability remains challenging. We present SCALAR, a benchmark designed to assess citation-grounded long-context reasoning in academic writing. SCALAR leverages academic papers and their citation structure to automatically generate high-quality ground-truth labels without human annotation. It features controllable difficulty levels and a dynamic updating mechanism that mitigates data contamination. The benchmark includes two tasks: a multiple-choice QA format and a cloze-style citation prediction. We evaluate a range of state-of-the-art LLMs and find that the multiple-choice task effectively distinguishes model capabilities. While human experts achieve over 90% accuracy, most models struggle. The cloze-style task is even more challenging, with no model exceeding 50%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

librairesearch/scalar
noneOfficial

Videos

SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning· underline

Taxonomy

TopicsOnline Learning and Analytics · Intelligent Tutoring Systems and Adaptive Learning