Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training
Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR
This paper introduces PAC-S++, a learnable, CLIP-based metric for better caption evaluation and fine-tuning, leading to more accurate, semantically rich captions with fewer errors and hallucinations.
Contribution
The paper presents PAC-S++, a novel, learnable evaluation metric leveraging enhanced CLIP pre-training and positive sample regularization for improved caption assessment and model fine-tuning.
Findings
PAC-S++ outperforms popular metrics in caption evaluation.
Integrating PAC-S++ in fine-tuning yields richer, more accurate captions.
The approach reduces hallucinations and grammatical errors in generated captions.
Abstract
Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Online and Blended Learning
MethodsContrastive Language-Image Pre-training
