Improving Cross-Modal Understanding in Visual Dialog via Contrastive   Learning

Feilong Chen; Xiuyi Chen; Shuang Xu; Bo Xu

arXiv:2204.07302·cs.CV·April 18, 2022

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Feilong Chen, Xiuyi Chen, Shuang Xu, Bo Xu

PDF

Open Access

TL;DR

This paper introduces ICMU, a contrastive learning method that improves cross-modal understanding in visual dialog models by distinguishing different inputs and leveraging single-turn visual question answering, leading to better performance on the VisDial dataset.

Contribution

The paper proposes a novel contrastive learning approach, ICMU, that enhances cross-modal understanding in visual dialog models by distinguishing inputs and utilizing single-turn VQA to improve multi-turn conversations.

Findings

01

ICMU improves cross-modal understanding in visual dialog models.

02

The approach yields significant gains on the VisDial dataset.

03

Contrastive learning effectively distinguishes different pulled inputs.

Abstract

Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history. Though existing methods try to deal with the cross-modal understanding in visual dialog, they are still not enough in ranking candidate answers based on their understanding of visual and textual contexts. In this paper, we analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT and propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU. ICMU enhances cross-modal understanding by distinguishing different pulled inputs (i.e. pulled images, questions or answers) based on four-way contrastive learning. In addition, ICMU exploits the single-turn visual question answering to enhance the visual dialog model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization