Multiple-Question Multiple-Answer Text-VQA
Peng Tang, Srikar Appalaraju, R. Manmatha, Yusheng Xie, Vijay, Mahadevan

TL;DR
This paper introduces MQMA, a transformer-based model that efficiently answers multiple questions simultaneously from multi-modal content in text-VQA tasks, outperforming previous single-question methods.
Contribution
The paper proposes a novel encoder-decoder transformer architecture and a pre-training task for multi-question text-VQA, enabling simultaneous multi-question answering and achieving state-of-the-art results.
Findings
Achieved +2.5% on OCR-VQA dataset
Achieved +1.4% on TextVQA dataset
Achieved +0.6% on ST-VQA dataset
Abstract
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to predict a single answer. In order to answer multiple questions from the same image, each question and content are fed into the model multiple times. In contrast, our proposed MQMA approach takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner at the same time. We make several novel architectural modifications to standard encoder-decoder transformers to support MQMA. We also propose a novel MQMA denoising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsALIGN
