Bilinear Graph Networks for Visual Question Answering

Dalu Guo; Chang Xu; Dacheng Tao

arXiv:1907.09815·cs.CV·February 4, 2020·6 cites

Bilinear Graph Networks for Visual Question Answering

Dalu Guo, Chang Xu, Dacheng Tao

PDF

Open Access

TL;DR

This paper introduces bilinear graph networks that model relationships between words and objects in visual question answering, enabling complex reasoning and achieving state-of-the-art accuracy on VQA v2.0.

Contribution

The paper proposes a novel bilinear graph network approach that models object and word relationships for improved reasoning in visual question answering.

Findings

01

Achieves 72.41% accuracy on VQA v2.0 test-std set.

02

Effectively models complex question reasoning.

03

Outperforms previous state-of-the-art methods.

Abstract

This paper revisits the bilinear attention networks in the visual question answering task from a graph perspective. The classical bilinear attention networks build a bilinear attention map to extract the joint representation of words in the question and objects in the image but lack fully exploring the relationship between words for complex reasoning. In contrast, we develop bilinear graph networks to model the context of the joint embeddings of words and objects. Two kinds of graphs are investigated, namely image-graph and question-graph. The image-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The question-graph exchanges information between these output nodes from image-graph to amplify the implicit yet important relationship between objects. These two kinds of graphs cooperate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques