Hidden State Guidance: Improving Image Captioning using An Image   Conditioned Autoencoder

Jialin Wu; Raymond J. Mooney

arXiv:1910.14208·cs.CV·January 16, 2020

Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder

Jialin Wu, Raymond J. Mooney

PDF

Open Access

TL;DR

This paper introduces Hidden State Guidance (HSG), a novel training framework for image captioning that improves hidden state learning via a teacher autoencoder, leading to more accurate captions.

Contribution

HSG is a new method that aligns decoder hidden states with those from a teacher autoencoder, enhancing caption quality beyond existing models.

Findings

01

HSG outperforms state-of-the-art captioning models.

02

Word-level rewards improve hidden state learning.

03

Method is effective with raw images or detected objects.

Abstract

Most RNN-based image captioning models receive supervision on the output words to mimic human captions. Therefore, the hidden states can only receive noisy gradient signals via layers of back-propagation through time, leading to less accurate generated captions. Consequently, we propose a novel framework, Hidden State Guidance (HSG), that matches the hidden states in the caption decoder to those in a teacher decoder trained on an easier task of autoencoding the captions conditioned on the image. During training with the REINFORCE algorithm, the conventional rewards are sentence-based evaluation metrics equally distributed to each generated word, no matter their relevance. HSG provides a word-level reward that helps the model learn better hidden representations. Experimental results demonstrate that HSG clearly outperforms various state-of-the-art caption decoders using either raw images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsREINFORCE