Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for   Visual Question Answering

Zhou Yu; Jun Yu; Jianping Fan; Dacheng Tao

arXiv:1708.01471·cs.CV·August 7, 2017·101 cites

Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Zhou Yu, Jun Yu, Jianping Fan, Dacheng Tao

PDF

Open Access 5 Repos

TL;DR

This paper introduces a novel multi-modal factorized bilinear pooling method combined with co-attention mechanisms to improve visual question answering performance, achieving state-of-the-art results efficiently.

Contribution

It proposes a new MFB pooling technique and a co-attention mechanism integrated into a unified deep network for enhanced VQA performance.

Findings

01

MFB outperforms other bilinear pooling methods in VQA.

02

The combined MFB and co-attention model achieves state-of-the-art accuracy.

03

The approach is computationally efficient for practical applications.

Abstract

Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multi-modal feature fusion, here we develop a Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilinear pooling approaches. For fine-grained image and question representation, we develop a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning