SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, Tao Chen

TL;DR
SC-Captioner introduces a reinforcement learning framework with a novel reward function for self-correcting image captioning, leveraging scene-graph parsing and a refined set of metrics, leading to improved caption quality.
Contribution
The paper presents a new self-correcting image captioning framework with a specialized reward function and introduces RefinedCaps, a fine-grained caption dataset, enhancing caption accuracy and evaluation.
Findings
Outperforms existing captioning methods in various scenarios.
Improves caption quality using self-correction and scene-graph based rewards.
Provides a new dataset, RefinedCaps, for detailed caption evaluation.
Abstract
We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
