Learning Compact Reward for Image Captioning
Nannan Li, Zhenzhong Chen

TL;DR
This paper introduces rAIRL, a novel adversarial inverse reinforcement learning method that disentangles word rewards and refines training stability to improve diversity and quality in image captioning.
Contribution
The paper proposes a refined adversarial IRL approach that addresses reward ambiguity and mode collapse, enhancing image captioning performance.
Findings
Effective disentanglement of word rewards improves caption quality.
Enhanced training stability leads to more diverse descriptions.
Outperforms existing methods on MS COCO and Flickr30K datasets.
Abstract
Adversarial learning has shown its advances in generating natural and diverse descriptions in image captioning. However, the learned reward of existing adversarial methods is vague and ill-defined due to the reward ambiguity problem. In this paper, we propose a refined Adversarial Inverse Reinforcement Learning (rAIRL) method to handle the reward ambiguity problem by disentangling reward for each word in a sentence, as well as achieve stable adversarial training by refining the loss function to shift the generator towards Nash equilibrium. In addition, we introduce a conditional term in the loss function to mitigate mode collapse and to increase the diversity of the generated descriptions. Our experiments on MS COCO and Flickr30K show that our method can learn compact reward for image captioning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
