TL;DR
This paper introduces a zero-shot image captioning method that effectively describes novel objects without additional annotations by decoupling language generation from object recognition.
Contribution
It proposes the Decoupled Novel Object Captioner (DNOC) framework, which separates language modeling from object descriptions using placeholders and a key-value object memory.
Findings
Successfully describes novel objects in zero-shot settings
Outperforms baseline methods on MSCOCO dataset
Demonstrates effective decoupling of language and object recognition
Abstract
Image captioning is a challenging task where the machine automatically describes an image by sentences or phrases. It often requires a large number of paired image-sentence annotations for training. However, a pre-trained captioning model can hardly be applied to a new domain in which some novel object categories exist, i.e., the objects and their description words are unseen during model training. To correctly caption the novel object, it requires professional human workers to annotate the images by sentences with the novel words. It is labor expensive and thus limits its usage in real-world applications. In this paper, we introduce the zero-shot novel object captioning task where the machine generates descriptions without extra sentences about the novel object. To tackle the challenging problem, we propose a Decoupled Novel Object Captioner (DNOC) framework that can fully decouple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
