Evaluating Image Caption via Cycle-consistent Text-to-Image Generation
Tianyu Cui, Jinbin Bai, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua, Luo, Kaifu Zhang, Ye Shi

TL;DR
This paper introduces CAMScore, a novel reference-free image caption evaluation metric that uses cycle-consistent text-to-image generation to better align with human judgment, addressing modality gap issues in existing metrics.
Contribution
CAMScore is the first to leverage cycle-consistent text-to-image generation for reference-free caption evaluation, incorporating a three-level framework for comprehensive assessment.
Findings
CAMScore correlates better with human judgments than existing metrics.
The three-level evaluation provides detailed insights into caption quality.
Extensive experiments validate the effectiveness of CAMScore across datasets.
Abstract
Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free evaluation metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research has revealed that the modality gap generally exists in the representation of contrastive learning-based multi-modal systems, undermining the reliability of cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a cyclic reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Furthermore, to provide fine-grained information for a more comprehensive evaluation, we design a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsFocus
