Making the V in Text-VQA Matter
Shamanthak Hegde, Soumya Jahagirdar, Shankar Gangisetty

TL;DR
This paper proposes a method to improve Text-VQA by integrating visual features with OCR and question features, using combined datasets to enhance understanding and reduce bias in answer prediction.
Contribution
It introduces a novel approach that combines TextVQA and VQA datasets to better learn visual features, improving answer accuracy and contextual understanding in Text-VQA tasks.
Findings
Enhanced model performance on multiple datasets
Improved correlation between image features and text
Reduction in biased answer predictions
Abstract
Text-based VQA aims at answering questions by reading the text present in the images. It requires a large amount of scene-text relationship understanding compared to the VQA task. Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image but less importance is given to visual features and some questions do not require understanding the image. The models trained on this dataset predict biased answers due to the lack of understanding of visual context. For example, in questions like "What is written on the signboard?", the answer predicted by the model is always "STOP" which makes the model to ignore the image. To address these issues, we propose a method to learn visual features (making V matter in TextVQA) along with the OCR features and question features using VQA dataset as external knowledge for Text-based VQA.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
