ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic   Arithmetic

Yoad Tewel; Yoav Shalev; Idan Schwartz; Lior Wolf

arXiv:2111.14447·cs.CV·April 1, 2022·6 cites

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Yoad Tewel, Yoav Shalev, Idan Schwartz, Lior Wolf

PDF

Open Access 1 Repo

TL;DR

ZeroCap leverages contrastive learning models combined with large language models to generate descriptive captions for images in a zero-shot manner, enabling high-level visual reasoning like image arithmetic and analogies without additional training.

Contribution

The paper introduces a novel zero-shot image captioning approach that combines visual-semantic models with language models, enabling flexible image-to-text generation and visual arithmetic tasks.

Findings

01

Generated captions are less restrictive than supervised methods.

02

Demonstrated ability to perform image arithmetic and visual analogies.

03

Achieved high flexibility in zero-shot image-to-text tasks.

Abstract

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning steps. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text, and the output is a sentence. This enables novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yoadtew/zero-shot-image-to-text
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning