DiMBERT: Learning Vision-Language Grounded Representations with   Disentangled Multimodal-Attention

Fenglin Liu; Xian Wu; Shen Ge; Xuancheng Ren; Wei Fan; Xu Sun; Yuexian; Zou

arXiv:2210.16431·cs.CV·November 1, 2022

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu Sun, Yuexian, Zou

PDF

Open Access

TL;DR

DiMBERT introduces a novel vision-language model with disentangled attention spaces and visual concepts, achieving state-of-the-art results across multiple tasks by explicitly separating vision and language representations.

Contribution

The paper proposes DiMBERT, a framework with separated attention spaces for vision and language, and incorporates visual concepts to improve cross-modal understanding and performance.

Findings

01

Sets new state-of-the-art on three vision-language tasks

02

Improves existing models by up to 5% with DiM module

03

Demonstrates effectiveness of visual concepts in bridging modalities

Abstract

Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling