Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Sara Sarto; Nicholas Moratelli; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara

arXiv:2410.07336·cs.CV·July 31, 2025

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

PDF

Open Access 1 Repo

TL;DR

This paper introduces PAC-S++, a learnable, CLIP-based metric for better caption evaluation and fine-tuning, leading to more accurate, semantically rich captions with fewer errors and hallucinations.

Contribution

The paper presents PAC-S++, a novel, learnable evaluation metric leveraging enhanced CLIP pre-training and positive sample regularization for improved caption assessment and model fine-tuning.

Findings

01

PAC-S++ outperforms popular metrics in caption evaluation.

02

Integrating PAC-S++ in fine-tuning yields richer, more accurate captions.

03

The approach reduces hallucinations and grammatical errors in generated captions.

Abstract

Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aimagelab/pacscore
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Online and Blended Learning

MethodsContrastive Language-Image Pre-training