TL;DR
MultiCapCLIP is a zero-shot multilingual visual captioning method that generates captions without labeled image-caption pairs by auto-encoding prompts and retrieving concept prompts, demonstrating significant improvements across multiple benchmarks and languages.
Contribution
It introduces a novel zero-shot approach that leverages prompt auto-encoding and concept retrieval for multilingual visual captioning without labeled datasets.
Findings
Achieves 4.8% and 21.5% improvements in BLEU@4 and CIDEr metrics.
Effective across four languages and four benchmark datasets.
Outperforms state-of-the-art zero-shot and weakly-supervised methods.
Abstract
Supervised visual captioning models typically require a large scale of images or videos paired with descriptions in a specific language (i.e., the vision-caption pairs) for training. However, collecting and labeling large-scale datasets is time-consuming and expensive for many scenarios and languages. Therefore, sufficient labeled pairs are usually not available. To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets. In the training stage, MultiCapCLIP only requires text data for input. Then it conducts two main steps: 1) retrieving concept prompts that preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding the prompts to learn writing styles to output captions in a desired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
