Multiple-Question Multiple-Answer Text-VQA

Peng Tang; Srikar Appalaraju; R. Manmatha; Yusheng Xie; Vijay; Mahadevan

arXiv:2311.08622·cs.CV·November 16, 2023·1 cites

Multiple-Question Multiple-Answer Text-VQA

Peng Tang, Srikar Appalaraju, R. Manmatha, Yusheng Xie, Vijay, Mahadevan

PDF

Open Access 1 Video

TL;DR

This paper introduces MQMA, a transformer-based model that efficiently answers multiple questions simultaneously from multi-modal content in text-VQA tasks, outperforming previous single-question methods.

Contribution

The paper proposes a novel encoder-decoder transformer architecture and a pre-training task for multi-question text-VQA, enabling simultaneous multi-question answering and achieving state-of-the-art results.

Findings

01

Achieved +2.5% on OCR-VQA dataset

02

Achieved +1.4% on TextVQA dataset

03

Achieved +0.6% on ST-VQA dataset

Abstract

We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to predict a single answer. In order to answer multiple questions from the same image, each question and content are fed into the model multiple times. In contrast, our proposed MQMA approach takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner at the same time. We make several novel architectural modifications to standard encoder-decoder transformers to support MQMA. We also propose a novel MQMA denoising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multiple-Question Multiple-Answer Text-VQA· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsALIGN