HICEScore: A Hierarchical Metric for Image Captioning Evaluation
Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie,, Zhengjue Wang, Bo Chen

TL;DR
HICE-S is a new hierarchical, reference-free image captioning evaluation metric that detects local visual and textual details, providing interpretable scores and outperforming existing metrics on multiple benchmarks.
Contribution
It introduces a hierarchical scoring mechanism that detects local regions and phrases, overcoming limitations of global CLIP-based metrics for more accurate and interpretable evaluation.
Findings
Achieves state-of-the-art performance on several benchmarks.
Outperforms existing reference-free and reference-based metrics.
Provides interpretable evaluation process similar to human judgment.
Abstract
Image captioning evaluation metrics can be divided into two categories, reference-based metrics and reference-free metrics. However, reference-based approaches may struggle to evaluate descriptive captions with abundant visual details produced by advanced multimodal large language models, due to their heavy reliance on limited human-annotated references. In contrast, previous reference-free metrics have been proven effective via CLIP cross-modality similarity. Nonetheless, CLIP-based metrics, constrained by their solution of global image-text compatibility, often have a deficiency in detecting local textual hallucinations and are insensitive to small visual objects. Besides, their single-scale designs are unable to provide an interpretable evaluation process such as pinpointing the position of caption mistakes and identifying visual regions that have not been described. To move forward,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
