Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Zhou Yu, Jun Yu, Jianping Fan, Dacheng Tao

TL;DR
This paper introduces a novel multi-modal factorized bilinear pooling method combined with co-attention mechanisms to improve visual question answering performance, achieving state-of-the-art results efficiently.
Contribution
It proposes a new MFB pooling technique and a co-attention mechanism integrated into a unified deep network for enhanced VQA performance.
Findings
MFB outperforms other bilinear pooling methods in VQA.
The combined MFB and co-attention model achieves state-of-the-art accuracy.
The approach is computationally efficient for practical applications.
Abstract
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multi-modal feature fusion, here we develop a Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilinear pooling approaches. For fine-grained image and question representation, we develop a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
