BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

Gon\c{c}alo Gomes; Bruno Martins; Chrysoula Zerva

arXiv:2605.21728·cs.CV·May 22, 2026

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

Gon\c{c}alo Gomes, Bruno Martins, Chrysoula Zerva

PDF

TL;DR

BEiTScore is a new, efficient, reference-free image captioning evaluation metric that outperforms existing methods by balancing accuracy and computational cost, using a lightweight cross-encoder trained with adversarial data augmentation.

Contribution

The paper introduces a novel lightweight cross-encoder based on visual question-answering models, trained with adversarial augmentation, to improve caption evaluation accuracy and efficiency.

Findings

01

Achieves state-of-the-art performance on captioning benchmarks.

02

Maintains computational efficiency suitable for large-scale use.

03

Demonstrates robustness across diverse evaluation scenarios.

Abstract

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.