Guiding Image Captioning Models Toward More Specific Captions

Simon Kornblith; Lala Li; Zirui Wang; Thao Nguyen

arXiv:2307.16686·cs.CV·August 1, 2023

Guiding Image Captioning Models Toward More Specific Captions

Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen

PDF

Open Access

TL;DR

This paper introduces a method to generate more specific image captions by applying classifier-free guidance during decoding, balancing between image relevance and caption specificity, leading to improved reference-free metrics.

Contribution

It proposes a novel guidance technique for image captioning models that enhances caption specificity with minimal training modifications.

Findings

01

Guidance scale of 2 improves CLIPScore and retrieval recall

02

Decoding with guidance worsens standard captioning metrics

03

Language model guidance offers small improvements in caption quality

Abstract

Image captioning is conventionally formulated as the task of generating captions for images that match the distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing $p (caption ∣ image)$ and $p (image ∣ caption)$ . Compared to standard greedy decoding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training