SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

Lin Zhang; Xianfang Zeng; Kangcong Li; Gang Yu; Tao Chen

arXiv:2508.06125·cs.CV·August 11, 2025

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, Tao Chen

PDF

Open Access 1 Datasets

TL;DR

SC-Captioner introduces a reinforcement learning framework with a novel reward function for self-correcting image captioning, leveraging scene-graph parsing and a refined set of metrics, leading to improved caption quality.

Contribution

The paper presents a new self-correcting image captioning framework with a specialized reward function and introduces RefinedCaps, a fine-grained caption dataset, enhancing caption accuracy and evaluation.

Findings

01

Outperforms existing captioning methods in various scenarios.

02

Improves caption quality using self-correction and scene-graph based rewards.

03

Provides a new dataset, RefinedCaps, for detailed caption evaluation.

Abstract

We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zl2048/SC-Captioner-data
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques