Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism
Lei Kang, Rub\`en Tito, Ernest Valveny, Dimosthenis Karatzas

TL;DR
This paper introduces a novel multi-page Document VQA method using self-attention scoring and Pix2Struct, achieving state-of-the-art results efficiently without OCR and handling documents with hundreds of pages.
Contribution
The paper presents a new multi-page Document VQA approach that extends single-page models with a self-attention scoring mechanism, reducing resource demands and improving scalability.
Findings
Achieves state-of-the-art performance on multi-page Document VQA tasks.
Handles documents with nearly 800 pages efficiently.
Does not require OCR for effective question answering.
Abstract
Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
