Hierarchical multimodal transformers for Multi-Page DocVQA

Rub\`en Tito; Dimosthenis Karatzas; Ernest Valveny

arXiv:2212.05935·cs.CV·April 5, 2023·1 cites

Hierarchical multimodal transformers for Multi-Page DocVQA

Rub\`en Tito, Dimosthenis Karatzas, Ernest Valveny

PDF

Open Access 1 Repo 8 Models 2 Datasets

TL;DR

This paper extends Document Visual Question Answering to multi-page documents by creating a new dataset and proposing a hierarchical transformer model, Hi-VT5, that effectively processes long documents and offers explainability.

Contribution

Introduces MP-DocVQA dataset for multi-page DocVQA and proposes Hi-VT5, a hierarchical transformer model that handles long documents and provides explainability.

Findings

01

Hi-VT5 outperforms existing methods on MP-DocVQA.

02

The hierarchical approach effectively summarizes multi-page information.

03

The model can identify relevant pages as explanations.

Abstract

Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rubenpt91/MP-DocVQA-Framework
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Inverse Square Root Schedule · Linear Layer · Dense Connections · Gated Linear Unit · Attention Dropout · Adafactor · Residual Connection