TL;DR
ClaimDiff-RL introduces a novel reward framework for image captioning that uses visual claim differences to improve factual accuracy and coverage in reinforcement learning.
Contribution
It proposes a reference-conditioned claim difference reward system that enables fine-grained, verifiable feedback for caption quality in RL.
Findings
Reduces hallucination by increasing factual coverage.
Balances faithfulness and coverage tradeoff effectively.
Outperforms existing benchmarks on multiple fine-grained metrics.
Abstract
Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
