Modeling Image-Caption Rating from Comparative Judgments
Kezia Minni, Qiang Zhang, Monoshiz Mahbub Khan, Zhe Yu

TL;DR
This paper introduces a machine learning framework that models comparative judgments for image-caption rating, resulting in efficient, consistent, and high-performing ranking of image-caption pairs, reducing annotation costs.
Contribution
It proposes a novel framework that trains models on comparative judgments instead of direct ratings, improving efficiency and consistency in image-caption evaluation.
Findings
The regression model achieved Kendall's τc=0.812 on VICR dataset.
The comparative learning model achieved Kendall's τc=0.804, comparable to the regression model.
Comparative judgments are faster and more consistent than direct ratings in human studies.
Abstract
Image caption rating is becoming increasingly important because computer-generated captions are used extensively for descriptive annotation. However, rating the accuracy of captions in describing images is time-consuming and subjective in nature. In contrast, it is often easier for people to compare (between two pairs) which image-caption pair better matches each other. In this study, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Inspired by a state-of-the-art regression approach, we extracted visual and text features using a pre-trained ViLBERT model and tweaked the learning parameters of the baseline model to improve the model performance. This new regression model (with Kendall's )…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
