Deep Reinforcement Learning-based Image Captioning with Embedding Reward
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, Li-Jia Li

TL;DR
This paper proposes a novel reinforcement learning framework for image captioning that uses a policy and value network guided by a visual-semantic embedding reward to generate more accurate captions.
Contribution
It introduces a decision-making framework with policy and value networks trained via actor-critic reinforcement learning, utilizing a new embedding-based reward for improved captioning.
Findings
Outperforms state-of-the-art methods on Microsoft COCO dataset
Demonstrates improved evaluation metrics across multiple benchmarks
Validates the effectiveness of embedding-based reward in caption generation
Abstract
Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a "policy network" and a "value network" to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Deep Reinforcement Learning-Based Image Captioning With Embedding Reward· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
