ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Zitong Xu; Huiyu Duan; Shengyao Qin; Guangyu Yang; Guangji Ma; Xiongkuo Min; Ke Gu; Guangtao Zhai; Patrick Le Callet

arXiv:2604.03765·cs.CV·April 14, 2026

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Zitong Xu, Huiyu Duan, Shengyao Qin, Guangyu Yang, Guangji Ma, Xiongkuo Min, Ke Gu, Guangtao Zhai, Patrick Le Callet

PDF

TL;DR

This paper introduces ICBench, a comprehensive image captioning benchmark with diverse captions and an innovative ITIScore metric that aligns well with human judgments and generalizes across datasets.

Contribution

The paper presents a new large-scale benchmark dataset, ICBench, and an automated ITIScore metric for evaluating image captioning models, addressing limitations of existing benchmarks.

Findings

01

ICBench covers 12 content categories with 40K captions from 10 MLLMs.

02

ITIScore correlates strongly with human subjective scores.

03

ITIScore demonstrates robust zero-shot performance on other datasets.

Abstract

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.