Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning
Feilong Chen, Xiuyi Chen, Shuang Xu, Bo Xu

TL;DR
This paper introduces ICMU, a contrastive learning method that improves cross-modal understanding in visual dialog models by distinguishing different inputs and leveraging single-turn visual question answering, leading to better performance on the VisDial dataset.
Contribution
The paper proposes a novel contrastive learning approach, ICMU, that enhances cross-modal understanding in visual dialog models by distinguishing inputs and utilizing single-turn VQA to improve multi-turn conversations.
Findings
ICMU improves cross-modal understanding in visual dialog models.
The approach yields significant gains on the VisDial dataset.
Contrastive learning effectively distinguishes different pulled inputs.
Abstract
Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history. Though existing methods try to deal with the cross-modal understanding in visual dialog, they are still not enough in ranking candidate answers based on their understanding of visual and textual contexts. In this paper, we analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT and propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU. ICMU enhances cross-modal understanding by distinguishing different pulled inputs (i.e. pulled images, questions or answers) based on four-way contrastive learning. In addition, ICMU exploits the single-turn visual question answering to enhance the visual dialog model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
