TL;DR
DocScope is a comprehensive benchmark for evaluating trustworthy reasoning in long, multimodal documents, emphasizing structured evidence and fact verification over simple answer accuracy.
Contribution
It introduces a new four-stage evaluation protocol and a large annotated dataset to assess verifiable reasoning in long-document understanding models.
Findings
Answer accuracy alone is insufficient; complete evidence chains are achieved only 29% of the time.
Region grounding is the weakest stage in the reasoning trajectory.
Larger models with more activated parameters tend to perform better.
Abstract
Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
