TL;DR
This paper introduces NOC, a novel captioning model that leverages external datasets and semantic knowledge to describe many object categories unseen in traditional image-caption datasets, significantly improving diversity.
Contribution
The paper presents NOC, a deep visual semantic captioning model that generalizes to novel objects by integrating external recognition data and semantic embeddings.
Findings
NOC can generate captions for hundreds of unseen object categories.
The model outperforms prior work in describing diverse object categories.
Automatic and human evaluations confirm improved captioning diversity.
Abstract
Recent captioning models are limited in their ability to scale and describe concepts unseen in paired image-text corpora. We propose the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets. Our model takes advantage of external sources -- labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text. We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings, enabling the model to generalize and describe novel objects outside of image-caption datasets. We demonstrate that our model exploits semantic information to generate captions for hundreds of object categories in the ImageNet object recognition dataset that are not observed in MSCOCO image-caption…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Captioning Images With Diverse Objects· youtube
