ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Tianle Li; Xuyang Shen; Yan Ma; Rongxin Guo; Shaoxiang Chen; Jiacheng Chen; Haochen Wang; Hongyang Tang; Yucong Zhou; Yu Cheng

arXiv:2605.20278·cs.LG·May 21, 2026

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng

PDF

1 Repo

TL;DR

ClaimDiff-RL introduces a novel reward framework for image captioning that uses visual claim differences to improve factual accuracy and coverage in reinforcement learning.

Contribution

It proposes a reference-conditioned claim difference reward system that enables fine-grained, verifiable feedback for caption quality in RL.

Findings

01

Reduces hallucination by increasing factual coverage.

02

Balances faithfulness and coverage tradeoff effectively.

03

Outperforms existing benchmarks on multiple fine-grained metrics.

Abstract

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ltl3a87/ClaimDiff-RL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.