WEmbSim: A Simple yet Effective Metric for Image Captioning
Naeha Sharif, Lyndon White, Mohammed Bennamoun, Wei Liu, Syed Afaq Ali, Shah

TL;DR
WEmbSim is a simple cosine similarity-based metric using mean word embeddings that outperforms complex methods in unsupervised image caption evaluation, correlating well with human judgments.
Contribution
The paper introduces WEmbSim, a straightforward yet effective metric for image caption evaluation that surpasses complex existing metrics in correlation with human assessments.
Findings
WEmbSim outperforms SPICE, CIDEr, and WMD at system-level correlation.
It achieves the best accuracy in matching human consensus scores.
WEmbSim sets a new baseline for unsupervised caption evaluation metrics.
Abstract
The area of automatic image caption evaluation is still undergoing intensive research to address the needs of generating captions which can meet adequacy and fluency requirements. Based on our past attempts at developing highly sophisticated learning-based metrics, we have discovered that a simple cosine similarity measure using the Mean of Word Embeddings(MOWE) of captions can actually achieve a surprisingly high performance on unsupervised caption evaluation. This inspires our proposed work on an effective metric WEmbSim, which beats complex measures such as SPICE, CIDEr and WMD at system-level correlation with human judgments. Moreover, it also achieves the best accuracy at matching human consensus scores for caption pairs, against commonly used unsupervised methods. Therefore, we believe that WEmbSim sets a new baseline for any complex metric to be justified.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
