From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping
Junyang Wang, Ming Yan, Yi Zhang, Jitao Sang

TL;DR
This paper introduces Knight, a zero-shot method that maps images and videos to language representations using unsupervised cross-modal mapping, enabling effective caption generation without additional training.
Contribution
It proposes a novel unsupervised cross-modal mapping technique, Knight, that bridges the modality gap for zero-shot image and video captioning.
Findings
Achieves state-of-the-art zero-shot captioning performance
Uses only text-only unsupervised training
Effective for both image and video captioning
Abstract
With the development of Vision-Language Pre-training Models (VLPMs) represented by CLIP and ALIGN, significant breakthroughs have been achieved for association-based visual tasks such as image classification and image-text retrieval by the zero-shot capability of CLIP without fine-tuning. However, CLIP is hard to apply to generation-based tasks. This is due to the lack of decoder architecture and pre-training tasks for generation. Although previous works have created generation capacity for CLIP through additional language models, a modality gap between the CLIP representations of different modalities and the inability of CLIP to model the offset of this gap, which fails the concept to transfer across modalities. To solve the problem, we try to map images/videos to the language modality and generate captions from the language modality. In this paper, we propose the K-nearest-neighbor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsALIGN · Contrastive Language-Image Pre-training
