Multimodal Unified Attention Networks for Vision-and-Language Interactions
Zhou Yu, Yuhao Cui, Jun Yu, Dacheng Tao, Qi Tian

TL;DR
This paper introduces MUAN, a deep neural network with unified attention blocks that simultaneously models intra- and inter-modal interactions for improved vision-and-language understanding.
Contribution
It proposes a novel unified attention mechanism that captures both intra- and inter-modal interactions, enhancing multimodal feature representation.
Findings
MUAN achieves top-level performance on VQA datasets.
MUAN performs well on multiple visual grounding datasets.
Unified attention improves multimodal interaction modeling.
Abstract
Learning an effective attention mechanism for multimodal data is important in many vision-and-language tasks that require a synergic understanding of both the visual and textual contents. Existing state-of-the-art approaches use co-attention models to associate each visual object (e.g., image region) with each textual object (e.g., query word). Despite the success of these co-attention models, they only model inter-modal interactions while neglecting intra-modal interactions. Here we propose a general `unified attention' model that simultaneously captures the intra- and inter-modal interactions of multimodal features and outputs their corresponding attended representations. By stacking such unified attention blocks in depth, we obtain the deep Multimodal Unified Attention Network (MUAN), which can seamlessly be applied to the visual question answering (VQA) and visual grounding tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
