BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR
BRIDGE is a new learnable, reference-free image captioning evaluation metric that effectively incorporates visual information to better align with human judgment, outperforming existing metrics across multiple datasets.
Contribution
Introduces BRIDGE, a novel multimodal, reference-free evaluation metric that maps visual features into dense vectors and integrates them into pseudo-captions during evaluation.
Findings
Achieves state-of-the-art results among reference-free metrics.
Effectively incorporates image information without reference captions.
Outperforms existing evaluation scores on multiple datasets.
Abstract
Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
