CLID: Controlled-Length Image Descriptions with Limited Data

Elad Hirsch; Ayellet Tal

arXiv:2211.14835·cs.CV·January 23, 2024

CLID: Controlled-Length Image Descriptions with Limited Data

Elad Hirsch, Ayellet Tal

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CLID, a method for controlling image caption length by enriching datasets with self-generated captions and a novel training strategy, achieving state-of-the-art quality and length control in image and paragraph captioning.

Contribution

The paper presents a new training approach that enables effective length control in image captioning using limited data, applicable to both captions and paragraphs.

Findings

01

Significantly improves length-control in image captioning.

02

Achieves state-of-the-art caption quality.

03

Applicable to paragraph generation.

Abstract

Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one. Since existing image captioning datasets contain mostly short captions, generating long captions is challenging. To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions. These, however, might be of varying quality and are thus unsuitable for conventional training. We introduce a novel training strategy that selects the data points to be used at different times during the training. Our method dramatically improves the length-control abilities, while exhibiting SoTA performance in terms of caption quality. Our approach is general and is shown to be applicable also to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eladhi/clid
pytorchOfficial

Videos

CLID: Controlled-Length Image Descriptions With Limited Data· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization