Improving Visual Question Answering by Referring to Generated Paragraph Captions
Hyounghun Kim, Mohit Bansal

TL;DR
This paper introduces a combined visual and textual question answering model that leverages paragraph captions to improve accuracy, demonstrating that automatically generated captions enhance visual question answering performance.
Contribution
The paper proposes a novel VTQA model that fuses image and paragraph caption information through multiple attention mechanisms, improving VQA accuracy over existing models.
Findings
Paragraph captions improve VQA accuracy.
Automatically generated captions are effective for VQA.
The joint model outperforms strong baselines on Visual Genome.
Abstract
Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
