Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder
Chao Zeng, Tiesong Zhao, Sam Kwong

TL;DR
This paper introduces I^2CE, a novel learning-based image captioning evaluation metric that leverages contrastive and auto-encoder principles to better align with human judgments at the sentence level.
Contribution
It proposes a new contrastive learning-based metric with multiple model structures that improves correlation with human assessments for image captioning quality.
Findings
I^2CE with dual branches outperforms existing metrics in consistency with human judgments.
The method aligns well with scores from state-of-the-art captioning models.
I^2CE can serve as a complementary intrinsic indicator for caption evaluation.
Abstract
Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most of the current captioning metrics rely on token level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning, which we call Intrinsic Image Captioning Evaluation(). We develop three progressive model structures to learn the sentence level representations--single branch model, dual branches model, and triple branches model. Our empirical tests show that trained with dual branches structure achieves better consistency with human judgments to contemporary image captioning evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
