Are metrics measuring what they should? An evaluation of image captioning task metrics
Oth\'on Gonz\'alez-Ch\'avez, Guillermo Ruiz, Daniela Moctezuma, Tania, A. Ramirez-delReal

TL;DR
This paper evaluates the effectiveness of various image captioning metrics, including n-gram based and embedding-based methods, to determine how well they measure caption quality and semantic content.
Contribution
It provides a comprehensive comparison of current image captioning metrics using artificial and real caption data on the MS COCO dataset, highlighting their strengths and limitations.
Findings
Classical n-gram metrics are insufficient for capturing semantic content.
Embedding-based metrics like BERTScore and CLIPScore offer different insights.
Current metrics may not fully align with human judgment of caption quality.
Abstract
Image Captioning is a current research task to describe the image content using the objects and their relationships in the scene. To tackle this task, two important research areas converge, artificial vision, and natural language processing. In Image Captioning, as in any computational intelligence task, the performance metrics are crucial for knowing how well (or bad) a method performs. In recent years, it has been observed that classical metrics based on n-grams are insufficient to capture the semantics and the critical meaning to describe the content in an image. Looking to measure how well or not the set of current and more recent metrics are doing, in this article, we present an evaluation of several kinds of Image Captioning metrics and a comparison between them using the well-known MS COCO dataset. The metrics were selected from the most used in prior works, they are those based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
