Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
Brian Gordon, Yonatan Bitton, Andreea Marzoca, Yasumasa Onoe, Xiao Wang, Daniel Cohen-Or, Idan Szpektor

TL;DR
This paper introduces DOCCI-Critique, a detailed benchmark and a model called VNLI-Critique for fine-grained factual accuracy evaluation of paragraph-length image captions generated by Vision-Language Models, along with a Critic-and-Revise pipeline for improving caption quality.
Contribution
It provides a new benchmark with sentence-level annotations, a robust factuality classification model, and a pipeline for automatic caption correction, advancing fine-grained evaluation of detailed image captions.
Findings
VNLI-Critique generalizes well across benchmarks
AutoRater aligns closely with human judgments (e.g., 0.98 Spearman)
Critic-and-Revise improves caption factuality by 46% on DetailCaps-4870
Abstract
Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is clearly written and logically structured, with a well-motivated problem statement and a thorough discussion of limitations in prior research. The proposed benchmark, model, and pipeline are novel and supported by strong empirical evidence. - Benchmark Design (DOCCI-Critique): The authors present DOCCI-Critique, a carefully curated benchmark consisting of 1,400 captions generated by diverse VLMs and 10,216 sentence-level human annotations. Unlike prior VLM factuality benchmarks that
While the paper is overall well-executed, several aspects could be strengthened to enhance completeness and reproducibility. - Dataset Scale and Annotation Cost: As acknowledged by the authors, the limited sample size of DOCCI-Critique constrains its statistical robustness. The benchmark contains 10,216 sentence-level annotations, each requiring multi-rater validation and textual rationales. Given that generating these critiques demands human intervention per caption (as illustrated in Table 1
1. This paper is clear and well-organized overall. 2. I'd like to thank the authors for their labor-intensive, human-involved benchmark construction.
1. Several prior studies have proposed caption revision models to enhance image caption quality [1,2]. However, the paper does not clearly articulate how its contributions go beyond these existing approaches. 2. The proposed critic-and-revise pipeline appears to be highly similar to the method introduced in a recent study [3]. A detailed comparison with this work is necessary to clarify the novelty of the proposed approach. [1] Zhou et al., "ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARG
- The work provides a richly annotated, sentence-level factuality dataset that fills a clear gap in evaluating long VLM-generated captions. - The VNLI-Critique model offers a scalable, generalizable evaluator that performs well across multiple external benchmarks. - The Critic-and-Revise framework delivers meaningful factuality improvements, demonstrating real downstream utility.
- Because both the critic model and the revise pipeline rely heavily on DOCCI-style annotations, there is a possibility that the system implicitly learns annotator-specific bias rather than true general fine-grained factuality. - Sentence-level evaluation may miss cross-sentence logical dependencies or global coherence errors, potentially encouraging overly localized correction strategies that do not improve holistic caption quality. - Lack of necessary case studies and error analysis to intuiti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
