Stack-Captioning: Coarse-to-Fine Learning for Image Captioning
Jiuxiang Gu, Jianfei Cai, Gang Wang, Tsuhan Chen

TL;DR
This paper introduces a coarse-to-fine multi-stage image captioning framework that improves description richness and training stability, achieving state-of-the-art results on MSCOCO.
Contribution
It proposes a multi-stage captioning model with intermediate supervision and reinforcement learning to address vanishing gradients and exposure bias.
Findings
Achieves state-of-the-art performance on MSCOCO
Effectively generates more detailed image descriptions
Addresses vanishing gradient and exposure bias issues
Abstract
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Our proposed learning approach addresses the difficulty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
