Bilinear Graph Networks for Visual Question Answering
Dalu Guo, Chang Xu, Dacheng Tao

TL;DR
This paper introduces bilinear graph networks that model relationships between words and objects in visual question answering, enabling complex reasoning and achieving state-of-the-art accuracy on VQA v2.0.
Contribution
The paper proposes a novel bilinear graph network approach that models object and word relationships for improved reasoning in visual question answering.
Findings
Achieves 72.41% accuracy on VQA v2.0 test-std set.
Effectively models complex question reasoning.
Outperforms previous state-of-the-art methods.
Abstract
This paper revisits the bilinear attention networks in the visual question answering task from a graph perspective. The classical bilinear attention networks build a bilinear attention map to extract the joint representation of words in the question and objects in the image but lack fully exploring the relationship between words for complex reasoning. In contrast, we develop bilinear graph networks to model the context of the joint embeddings of words and objects. Two kinds of graphs are investigated, namely image-graph and question-graph. The image-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The question-graph exchanges information between these output nodes from image-graph to amplify the implicit yet important relationship between objects. These two kinds of graphs cooperate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
