CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Dongsheng Ma; Jiayu Li; Zhengren Wang; Yijie Wang; Jiahao Kong; Weijun Zeng; Jutao Xiao; Jie Yang; Wentao Zhang; Bin Wang; Conghui He

arXiv:2605.12882·cs.CL·May 14, 2026

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Dongsheng Ma, Jiayu Li, Zhengren Wang, Yijie Wang, Jiahao Kong, Weijun Zeng, Jutao Xiao, Jie Yang, Wentao Zhang, Bin Wang, Conghui He

PDF

1 Repo 1 Datasets

TL;DR

CiteVQA is a benchmark that evaluates multimodal language models on their ability to correctly cite evidence regions in documents, exposing a prevalent issue of models providing correct answers with incorrect citations.

Contribution

We introduce CiteVQA, a comprehensive benchmark with automated evidence annotation and strict accuracy metrics to assess evidence attribution in document understanding models.

Findings

01

Models often produce correct answers with incorrect evidence citations.

02

The best model achieves only 76.0% accuracy in correct answer and citation pairing.

03

Open-source models perform significantly worse, with the top open-source reaching just 22.5%.

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opendatalab/CiteVQA
github

Datasets

opendatalab/CiteVQA
dataset· 520 dl
520 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.