TL;DR
This paper introduces R-VQA, a framework that leverages visual relation facts with semantic attention to improve visual question answering, achieving state-of-the-art results by integrating semantic knowledge and visual relations.
Contribution
The paper proposes a novel R-VQA framework that learns and utilizes visual relation facts with semantic attention, enhancing VQA performance beyond existing methods.
Findings
Achieves state-of-the-art results on benchmark datasets.
Demonstrates the effectiveness of visual relation facts in VQA.
Shows benefits of semantic attention in integrating knowledge.
Abstract
Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to learn their joint feature embedding via multimodal fusion or attention mechanism. Some recent studies utilize external VQA-independent models to detect candidate entities or attributes in images, which serve as semantic knowledge complementary to the VQA task. However, these candidate entities or attributes might be unrelated to the VQA task and have limited semantic capacities. To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA. Specifically, we build up a Relation-VQA (R-VQA) dataset based on the Visual Genome dataset via a semantic similarity module, in which each data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
