Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects
Ting Yao, Yingwei Pan, Yehao Li, Tao Mei

TL;DR
This paper introduces LSTM-C, a novel image captioning architecture that incorporates a copying mechanism, enabling the model to describe novel objects outside the training data by leveraging object recognition datasets.
Contribution
The paper presents a new LSTM-C architecture that integrates copying mechanisms into image captioning models to improve description of unseen objects.
Findings
LSTM-C effectively describes novel objects in captions.
LSTM-C outperforms state-of-the-art models on MSCOCO and ImageNet.
The copying mechanism enhances caption diversity and accuracy.
Abstract
Image captioning often requires a large set of training image-sentence pairs. In practice, however, acquiring sufficient training pairs is always expensive, making the recent captioning models limited in their ability to describe objects outside of training corpora (i.e., novel objects). In this paper, we present Long Short-Term Memory with Copying Mechanism (LSTM-C) --- a new architecture that incorporates copying into the Convolutional Neural Networks (CNN) plus Recurrent Neural Networks (RNN) image captioning framework, for describing novel objects in captions. Specifically, freely available object recognition datasets are leveraged to develop classifiers for novel objects. Our LSTM-C then nicely integrates the standard word-by-word sentence generation by a decoder RNN with copying mechanism which may instead select words from novel objects at proper places in the output sentence.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
