MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual   Captioning

Bang Yang; Fenglin Liu; Xian Wu; Yaowei Wang; Xu Sun; and Yuexian Zou

arXiv:2308.13218·cs.CV·August 28, 2023

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Bang Yang, Fenglin Liu, Xian Wu, Yaowei Wang, Xu Sun, and Yuexian Zou

PDF

1 Repo

TL;DR

MultiCapCLIP is a zero-shot multilingual visual captioning method that generates captions without labeled image-caption pairs by auto-encoding prompts and retrieving concept prompts, demonstrating significant improvements across multiple benchmarks and languages.

Contribution

It introduces a novel zero-shot approach that leverages prompt auto-encoding and concept retrieval for multilingual visual captioning without labeled datasets.

Findings

01

Achieves 4.8% and 21.5% improvements in BLEU@4 and CIDEr metrics.

02

Effective across four languages and four benchmark datasets.

03

Outperforms state-of-the-art zero-shot and weakly-supervised methods.

Abstract

Supervised visual captioning models typically require a large scale of images or videos paired with descriptions in a specific language (i.e., the vision-caption pairs) for training. However, collecting and labeling large-scale datasets is time-consuming and expensive for many scenarios and languages. Therefore, sufficient labeled pairs are usually not available. To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets. In the training stage, MultiCapCLIP only requires text data for input. Then it conducts two main steps: 1) retrieving concept prompts that preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding the prompts to learn writing styles to output captions in a desired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangbang18/multicapclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.