DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

Nakamasa Inoue; Kanoko Goto; Masanari Oi; Martyna Gruszka; Mahiro Ukai; Takumi Hirose; Yusuke Sekikawa

arXiv:2512.14420·cs.CV·January 6, 2026

DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa

PDF

Open Access 1 Video

TL;DR

DISCODE is a novel, test-time adaptive evaluation method for image captioning that improves robustness and alignment with human judgments across diverse domains without requiring finetuning.

Contribution

It introduces DISCODE, a finetuning-free, distribution-aware evaluation approach using ATT loss and Gaussian prior, and presents MCEval, a new multi-domain caption evaluation benchmark.

Findings

01

DISCODE outperforms existing metrics on MCEval and other benchmarks.

02

It achieves state-of-the-art correlation with human judgments.

03

The method is robust under domain-shift scenarios.

Abstract

Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis