Question-Driven Graph Fusion Network For Visual Question Answering
Yuxi Qian, Yuncong Hu, Ruonan Wang, Fangxiang Feng, Xiaojie Wang

TL;DR
The paper introduces QD-GFN, a novel VQA model that uses question-guided graph attention and object filtering to improve reasoning accuracy by reducing irrelevant information.
Contribution
It proposes a question-driven graph fusion network with an object filtering mechanism, enhancing visual reasoning in VQA tasks.
Findings
Outperforms state-of-the-art on VQA 2.0 and VQA-CP v2 datasets.
Graph aggregation and object filtering significantly improve performance.
Effectively reduces irrelevant information from inaccurate object detection.
Abstract
Existing Visual Question Answering (VQA) models have explored various visual relationships between objects in the image to answer complex questions, which inevitably introduces irrelevant information brought by inaccurate object detection and text grounding. To address the problem, we propose a Question-Driven Graph Fusion Network (QD-GFN). It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs, further, our QD-GFN adopts an object filtering mechanism to remove question-irrelevant objects contained in the image. Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets. Further analysis shows that both the novel graph aggregation method and object filtering mechanism play a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
