TL;DR
This paper introduces a novel multimodal fusion method called MFH for visual question answering, combining co-attention for feature extraction and KL divergence for answer prediction, achieving state-of-the-art results.
Contribution
It proposes a generalized high-order pooling approach for better multimodal feature fusion and integrates co-attention and KL divergence into a unified VQA model.
Findings
Achieved state-of-the-art performance on large-scale VQA datasets.
Developed a generalized multi-modal fusion method (MFH).
Faster convergence and improved accuracy in answer prediction.
Abstract
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multi-modal feature fusion that is able to capture the complex interactions between multi-modal features; 3) automatic answer prediction that is able to consider the complex correlations between multiple diverse answers for the same question. For fine-grained image and question representations, a `co-attention' mechanism is developed by using a deep neural network architecture to jointly learn the attentions for both the image and the question, which can allow us to reduce the irrelevant features effectively and obtain more discriminative features for image and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
