Multimodal Unified Attention Networks for Vision-and-Language   Interactions

Zhou Yu; Yuhao Cui; Jun Yu; Dacheng Tao; Qi Tian

arXiv:1908.04107·cs.CV·August 20, 2019·33 cites

Multimodal Unified Attention Networks for Vision-and-Language Interactions

Zhou Yu, Yuhao Cui, Jun Yu, Dacheng Tao, Qi Tian

PDF

Open Access

TL;DR

This paper introduces MUAN, a deep neural network with unified attention blocks that simultaneously models intra- and inter-modal interactions for improved vision-and-language understanding.

Contribution

It proposes a novel unified attention mechanism that captures both intra- and inter-modal interactions, enhancing multimodal feature representation.

Findings

01

MUAN achieves top-level performance on VQA datasets.

02

MUAN performs well on multiple visual grounding datasets.

03

Unified attention improves multimodal interaction modeling.

Abstract

Learning an effective attention mechanism for multimodal data is important in many vision-and-language tasks that require a synergic understanding of both the visual and textual contents. Existing state-of-the-art approaches use co-attention models to associate each visual object (e.g., image region) with each textual object (e.g., query word). Despite the success of these co-attention models, they only model inter-modal interactions while neglecting intra-modal interactions. Here we propose a general `unified attention' model that simultaneously captures the intra- and inter-modal interactions of multimodal features and outputs their corresponding attended representations. By stacking such unified attention blocks in depth, we obtain the deep Multimodal Unified Attention Network (MUAN), which can seamlessly be applied to the visual question answering (VQA) and visual grounding tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques