Question-Guided Hybrid Convolution for Visual Question Answering
Peng Gao, Pan Lu, Hongsheng Li, Shuang Li, Yikang Li, Steven Hoi,, Xiaogang Wang

TL;DR
This paper introduces a Question-Guided Hybrid Convolution network for VQA that captures textual-visual relationships early, reduces parameters with group convolution, and enhances existing methods for improved accuracy.
Contribution
The paper presents a novel question-guided hybrid convolution approach that effectively fuses textual and visual features with fewer parameters and complements existing VQA techniques.
Findings
Improves VQA accuracy on public datasets.
Reduces model parameters via group convolution.
Enhances performance when combined with bilinear pooling and attention methods.
Abstract
In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features.To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsConvolution
