CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

Zhijiang Tang; Linhua Wang; Jiaxin Qi; Weihao Jiang; Peng Hou; Anxiang Zeng; Jianqiang Huang

arXiv:2602.21655·cs.CV·March 31, 2026

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang

PDF

TL;DR

This paper introduces CCCaption, a reinforcement learning framework that optimizes image captions for completeness and correctness, addressing limitations of human-annotated references.

Contribution

It proposes a dual-reward reinforcement learning approach that explicitly enhances caption completeness and correctness using visual queries and hallucination penalties.

Findings

01

Consistent improvements across standard benchmarks.

02

Effective disentanglement of visual facts using diverse LVLMs.

03

Guides caption models beyond human-annotation imitation.

Abstract

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.