Hierarchical multimodal transformers for Multi-Page DocVQA
Rub\`en Tito, Dimosthenis Karatzas, Ernest Valveny

TL;DR
This paper extends Document Visual Question Answering to multi-page documents by creating a new dataset and proposing a hierarchical transformer model, Hi-VT5, that effectively processes long documents and offers explainability.
Contribution
Introduces MP-DocVQA dataset for multi-page DocVQA and proposes Hi-VT5, a hierarchical transformer model that handles long documents and provides explainability.
Findings
Hi-VT5 outperforms existing methods on MP-DocVQA.
The hierarchical approach effectively summarizes multi-page information.
The model can identify relevant pages as explanations.
Abstract
Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗rubentito/bigbird-base-itc-mpdocvqamodel· 6 dl6 dl
- 🤗rubentito/longformer-base-mpdocvqamodel· 2 dl2 dl
- 🤗rubentito/t5-base-mpdocvqamodel· 6 dl· ♡ 16 dl♡ 1
- 🤗rubentito/layoutlmv3-base-mpdocvqamodel· 153 dl· ♡ 10153 dl♡ 10
- 🤗rubentito/bert-large-mpdocvqamodel· 5 dl· ♡ 15 dl♡ 1
- 🤗rubentito/longt5-tglobal-base-mpdocvqamodel· 1 dl1 dl
- 🤗rubentito/hivt5-base-mpdocvqamodel· 2 dl· ♡ 52 dl♡ 5
- 🤗rubentito/vt5-base-spdocvqamodel· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Inverse Square Root Schedule · Linear Layer · Dense Connections · Gated Linear Unit · Attention Dropout · Adafactor · Residual Connection
