Revisiting Image Captioning Training Paradigm via Direct CLIP-based   Optimization

Nicholas Moratelli; Davide Caffagni; Marcella Cornia; Lorenzo Baraldi,; Rita Cucchiara

arXiv:2408.14547·cs.CV·August 28, 2024

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi,, Rita Cucchiara

PDF

Open Access 1 Repo

TL;DR

This paper introduces DiCO, a novel training paradigm for image captioning that directly optimizes CLIP-based metrics, resulting in more stable training and captions that better align with human preferences.

Contribution

The paper proposes a new training method, DiCO, which jointly learns a reward model from a captioning evaluator and optimizes captioning directly with CLIP-based metrics, improving stability and quality.

Findings

01

Enhanced stability in training process.

02

Captions better aligned with human preferences.

03

Maintains competitive performance on traditional metrics.

Abstract

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aimagelab/dico
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization