BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger   Visual Cues

Sara Sarto; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara

arXiv:2407.20341·cs.CV·July 31, 2024

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

PDF

Open Access 1 Repo

TL;DR

BRIDGE is a new learnable, reference-free image captioning evaluation metric that effectively incorporates visual information to better align with human judgment, outperforming existing metrics across multiple datasets.

Contribution

Introduces BRIDGE, a novel multimodal, reference-free evaluation metric that maps visual features into dense vectors and integrates them into pseudo-captions during evaluation.

Findings

01

Achieves state-of-the-art results among reference-free metrics.

02

Effectively incorporates image information without reference captions.

03

Outperforms existing evaluation scores on multiple datasets.

Abstract

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aimagelab/bridge-score
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization