Modeling Image-Caption Rating from Comparative Judgments

Kezia Minni; Qiang Zhang; Monoshiz Mahbub Khan; Zhe Yu

arXiv:2602.00381·cs.CV·March 26, 2026

Modeling Image-Caption Rating from Comparative Judgments

Kezia Minni, Qiang Zhang, Monoshiz Mahbub Khan, Zhe Yu

PDF

Open Access

TL;DR

This paper introduces a machine learning framework that models comparative judgments for image-caption rating, resulting in efficient, consistent, and high-performing ranking of image-caption pairs, reducing annotation costs.

Contribution

It proposes a novel framework that trains models on comparative judgments instead of direct ratings, improving efficiency and consistency in image-caption evaluation.

Findings

01

The regression model achieved Kendall's τc=0.812 on VICR dataset.

02

The comparative learning model achieved Kendall's τc=0.804, comparable to the regression model.

03

Comparative judgments are faster and more consistent than direct ratings in human studies.

Abstract

Image caption rating is becoming increasingly important because computer-generated captions are used extensively for descriptive annotation. However, rating the accuracy of captions in describing images is time-consuming and subjective in nature. In contrast, it is often easier for people to compare (between two pairs) which image-caption pair better matches each other. In this study, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Inspired by a state-of-the-art regression approach, we extracted visual and text features using a pre-trained ViLBERT model and tweaked the learning parameters of the baseline model to improve the model performance. This new regression model (with Kendall's $τ_{c} = 0.812$ )…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques