DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

Xiang Feng; Jiawei Zhou; Zhangfeng Huang; Kewei Wang; Shanshan Ye; Jinxin Hu; Zulong Chen; Yong Luo; Jing Zhang

arXiv:2605.08888·cs.CL·May 15, 2026

DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

Xiang Feng, Jiawei Zhou, Zhangfeng Huang, Kewei Wang, Shanshan Ye, Jinxin Hu, Zulong Chen, Yong Luo, Jing Zhang

PDF

1 Repo

TL;DR

DocScope is a comprehensive benchmark for evaluating trustworthy reasoning in long, multimodal documents, emphasizing structured evidence and fact verification over simple answer accuracy.

Contribution

It introduces a new four-stage evaluation protocol and a large annotated dataset to assess verifiable reasoning in long-document understanding models.

Findings

01

Answer accuracy alone is insufficient; complete evidence chains are achieved only 29% of the time.

02

Region grounding is the weakest stage in the reasoning trajectory.

03

Larger models with more activated parameters tend to perform better.

Abstract

Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MiliLab/DocScope
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.