Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

Jiuxiang Gu; Jianfei Cai; Gang Wang; Tsuhan Chen

arXiv:1709.03376·cs.CV·March 15, 2018·59 cites

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

Jiuxiang Gu, Jianfei Cai, Gang Wang, Tsuhan Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a coarse-to-fine multi-stage image captioning framework that improves description richness and training stability, achieving state-of-the-art results on MSCOCO.

Contribution

It proposes a multi-stage captioning model with intermediate supervision and reinforcement learning to address vanishing gradients and exposure bias.

Findings

01

Achieves state-of-the-art performance on MSCOCO

02

Effectively generates more detailed image descriptions

03

Addresses vanishing gradient and exposure bias issues

Abstract

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Our proposed learning approach addresses the difficulty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showkeyjar/chinese_im2text.pytorch
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques