TL;DR
ScholScan introduces a comprehensive benchmark for evaluating multimodal large language models on scan-oriented academic paper reasoning, emphasizing full-document understanding and verification beyond relevance retrieval.
Contribution
This work presents ScholScan, a new benchmark with annotated questions, evidence localization, and reasoning traces to evaluate MLLMs on scan-oriented academic paper reasoning tasks.
Findings
MLLMs show systematic deficiencies on scan-oriented tasks.
Retrieval-augmented generation methods do not significantly improve performance.
ScholScan highlights the challenge of full-document understanding in current models.
Abstract
With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The fundamental reason is that current work on academic paper reasoning is largely confined to a search-oriented paradigm centered on pre-specified targets, with reasoning grounded in relevance retrieval, which struggles to support researcher-style full-document understanding, reasoning, and verification. To bridge this gap, we propose \textbf{ScholScan}, a new benchmark for academic paper reasoning. ScholScan introduces a scan-oriented task setting that asks models to read and cross-check entire papers like human researchers, scanning the document to identify consistency issues. The benchmark comprises 1,800 carefully annotated questions drawn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
