TL;DR
ZRIGF is a novel multimodal framework that enhances zero-resource image-grounded dialogue generation by integrating visual and textual information through contrastive and generative pre-training, demonstrating strong generalization in unseen domains.
Contribution
The paper introduces ZRIGF, a two-stage learning framework combining contrastive and generative pre-training for effective zero-resource image-grounded dialogue generation.
Findings
ZRIGF outperforms baselines in generating relevant responses.
Framework demonstrates robust generalization to new domains.
Effective multimodal feature alignment achieved through proposed modules.
Abstract
Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
