Contrastive Semantic Similarity Learning for Image Captioning Evaluation   with Intrinsic Auto-encoder

Chao Zeng; Tiesong Zhao; Sam Kwong

arXiv:2106.15312·cs.CV·June 30, 2021

Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder

Chao Zeng, Tiesong Zhao, Sam Kwong

PDF

Open Access

TL;DR

This paper introduces I^2CE, a novel learning-based image captioning evaluation metric that leverages contrastive and auto-encoder principles to better align with human judgments at the sentence level.

Contribution

It proposes a new contrastive learning-based metric with multiple model structures that improves correlation with human assessments for image captioning quality.

Findings

01

I^2CE with dual branches outperforms existing metrics in consistency with human judgments.

02

The method aligns well with scores from state-of-the-art captioning models.

03

I^2CE can serve as a complementary intrinsic indicator for caption evaluation.

Abstract

Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most of the current captioning metrics rely on token level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning, which we call Intrinsic Image Captioning Evaluation( $I^{2} C E$ ). We develop three progressive model structures to learn the sentence level representations--single branch model, dual branches model, and triple branches model. Our empirical tests show that $I^{2} C E$ trained with dual branches structure achieves better consistency with human judgments to contemporary image captioning evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning