From Association to Generation: Text-only Captioning by Unsupervised   Cross-modal Mapping

Junyang Wang; Ming Yan; Yi Zhang; Jitao Sang

arXiv:2304.13273·cs.CV·May 9, 2023·1 cites

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

Junyang Wang, Ming Yan, Yi Zhang, Jitao Sang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Knight, a zero-shot method that maps images and videos to language representations using unsupervised cross-modal mapping, enabling effective caption generation without additional training.

Contribution

It proposes a novel unsupervised cross-modal mapping technique, Knight, that bridges the modality gap for zero-shot image and video captioning.

Findings

01

Achieves state-of-the-art zero-shot captioning performance

02

Uses only text-only unsupervised training

03

Effective for both image and video captioning

Abstract

With the development of Vision-Language Pre-training Models (VLPMs) represented by CLIP and ALIGN, significant breakthroughs have been achieved for association-based visual tasks such as image classification and image-text retrieval by the zero-shot capability of CLIP without fine-tuning. However, CLIP is hard to apply to generation-based tasks. This is due to the lack of decoder architecture and pre-training tasks for generation. Although previous works have created generation capacity for CLIP through additional language models, a modality gap between the CLIP representations of different modalities and the inability of CLIP to model the offset of this gap, which fails the concept to transfer across modalities. To solve the problem, we try to map images/videos to the language modality and generate captions from the language modality. In this paper, we propose the K-nearest-neighbor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

junyangwang0410/knight
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsALIGN · Contrastive Language-Image Pre-training