Multi-Page Document Visual Question Answering using Self-Attention   Scoring Mechanism

Lei Kang; Rub\`en Tito; Ernest Valveny; Dimosthenis Karatzas

arXiv:2404.19024·cs.CV·May 1, 2024

Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism

Lei Kang, Rub\`en Tito, Ernest Valveny, Dimosthenis Karatzas

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel multi-page Document VQA method using self-attention scoring and Pix2Struct, achieving state-of-the-art results efficiently without OCR and handling documents with hundreds of pages.

Contribution

The paper presents a new multi-page Document VQA approach that extends single-page models with a self-attention scoring mechanism, reducing resource demands and improving scalability.

Findings

01

Achieves state-of-the-art performance on multi-page Document VQA tasks.

02

Handles documents with nearly 800 pages efficiently.

03

Does not require OCR for effective question answering.

Abstract

Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leitro/selfattnscoring-mpdocvqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques