Making the V in Text-VQA Matter

Shamanthak Hegde; Soumya Jahagirdar; Shankar Gangisetty

arXiv:2308.00295·cs.CV·August 2, 2023

Making the V in Text-VQA Matter

Shamanthak Hegde, Soumya Jahagirdar, Shankar Gangisetty

PDF

Open Access

TL;DR

This paper proposes a method to improve Text-VQA by integrating visual features with OCR and question features, using combined datasets to enhance understanding and reduce bias in answer prediction.

Contribution

It introduces a novel approach that combines TextVQA and VQA datasets to better learn visual features, improving answer accuracy and contextual understanding in Text-VQA tasks.

Findings

01

Enhanced model performance on multiple datasets

02

Improved correlation between image features and text

03

Reduction in biased answer predictions

Abstract

Text-based VQA aims at answering questions by reading the text present in the images. It requires a large amount of scene-text relationship understanding compared to the VQA task. Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image but less importance is given to visual features and some questions do not require understanding the image. The models trained on this dataset predict biased answers due to the lack of understanding of visual context. For example, in questions like "What is written on the signboard?", the answer predicted by the model is always "STOP" which makes the model to ignore the image. To address these issues, we propose a method to learn visual features (making V matter in TextVQA) along with the OCR features and question features using VQA dataset as external knowledge for Text-based VQA.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques