AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering
Zongmin Li, Yachuan Li, Lei Kang, Dimosthenis Karatzas, Wenkang Ma

TL;DR
AVIR introduces an adaptive retrieval framework that efficiently selects relevant pages from multi-page documents, significantly reducing computational costs while maintaining high answer accuracy in visual question answering tasks.
Contribution
It proposes a novel adaptive retrieval method that improves efficiency and accuracy in multi-page document VQA without requiring model fine-tuning.
Findings
Reduces page usage by 70% on average.
Achieves an ANLS of 84.58% on MP-DocVQA.
Outperforms previous methods with lower computational cost.
Abstract
Multi-page Document Visual Question Answering (MP-DocVQA) remains challenging because long documents not only strain computational resources but also reduce the effectiveness of the attention mechanism in large vision-language models (LVLMs). We tackle these issues with an Adaptive Visual In-document Retrieval (AVIR) framework. A lightweight retrieval model first scores each page for question relevance. Pages are then clustered according to the score distribution to adaptively select relevant content. The clustered pages are screened again by Top-K to keep the context compact. However, for short documents, clustering reliability decreases, so we use a relevance probability threshold to select pages. The selected pages alone are fed to a frozen LVLM for answer generation, eliminating the need for model fine-tuning. The proposed AVIR framework reduces the average page count required for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
