Bilinear Attention Networks

Jin-Hwa Kim; Jaehyun Jun; Byoung-Tak Zhang

arXiv:1805.07932·cs.CV·October 22, 2018·78 cites

Bilinear Attention Networks

Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang

PDF

Open Access 5 Repos

TL;DR

Bilinear Attention Networks (BAN) efficiently model interactions between visual and language inputs using bilinear attention, significantly improving performance on visual question answering and image captioning tasks.

Contribution

BAN introduces bilinear attention with low-rank pooling to capture multimodal interactions, surpassing previous co-attention methods in efficiency and accuracy.

Findings

01

BAN achieves state-of-the-art results on VQA 2.0.

02

BAN outperforms previous methods on Flickr30k Entities.

03

The proposed model demonstrates superior multimodal interaction modeling.

Abstract

Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition