Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions
Kjetil Indrehus, Adrian Duric, Changkyu Choi, Ali Ramezani-Kebrya

TL;DR
This paper introduces CoExVQA, a self-explainable framework for Document Visual Question Answering that explicitly grounds reasoning steps, improving explainability and performance over existing models.
Contribution
It presents a novel chain-of-explanation approach that explicitly localizes evidence and answers, enhancing transparency and accuracy in DocVQA.
Findings
Achieves state-of-the-art explainable DocVQA performance on PFL-DocVQA.
Improves ANLS by 12% over current explainable baselines.
Enables direct inspection and verification of reasoning process.
Abstract
Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
