Beyond Bilinear: Generalized Multimodal Factorized High-order Pooling   for Visual Question Answering

Zhou Yu; Jun Yu; Chenchao Xiang; Jianping Fan; Dacheng Tao

arXiv:1708.03619·cs.CV·May 17, 2019

Beyond Bilinear: Generalized Multimodal Factorized High-order Pooling for Visual Question Answering

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, Dacheng Tao

PDF

2 Repos

TL;DR

This paper introduces a novel multimodal fusion method called MFH for visual question answering, combining co-attention for feature extraction and KL divergence for answer prediction, achieving state-of-the-art results.

Contribution

It proposes a generalized high-order pooling approach for better multimodal feature fusion and integrates co-attention and KL divergence into a unified VQA model.

Findings

01

Achieved state-of-the-art performance on large-scale VQA datasets.

02

Developed a generalized multi-modal fusion method (MFH).

03

Faster convergence and improved accuracy in answer prediction.

Abstract

Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multi-modal feature fusion that is able to capture the complex interactions between multi-modal features; 3) automatic answer prediction that is able to consider the complex correlations between multiple diverse answers for the same question. For fine-grained image and question representations, a `co-attention' mechanism is developed by using a deep neural network architecture to jointly learn the attentions for both the image and the question, which can allow us to reduce the irrelevant features effectively and obtain more discriminative features for image and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.