Question-Guided Hybrid Convolution for Visual Question Answering

Peng Gao; Pan Lu; Hongsheng Li; Shuang Li; Yikang Li; Steven Hoi,; Xiaogang Wang

arXiv:1808.02632·cs.CV·August 9, 2018·22 cites

Question-Guided Hybrid Convolution for Visual Question Answering

Peng Gao, Pan Lu, Hongsheng Li, Shuang Li, Yikang Li, Steven Hoi,, Xiaogang Wang

PDF

Open Access

TL;DR

This paper introduces a Question-Guided Hybrid Convolution network for VQA that captures textual-visual relationships early, reduces parameters with group convolution, and enhances existing methods for improved accuracy.

Contribution

The paper presents a novel question-guided hybrid convolution approach that effectively fuses textual and visual features with fewer parameters and complements existing VQA techniques.

Findings

01

Improves VQA accuracy on public datasets.

02

Reduces model parameters via group convolution.

03

Enhances performance when combined with bilinear pooling and attention methods.

Abstract

In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features.To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsConvolution