TL;DR
This paper introduces an unsupervised method for image captioning that enhances diversity and specificity of generated captions by leveraging an image retrieval model, surpassing previous models in diversity and novelty.
Contribution
The authors propose a novel unsupervised training approach that improves caption diversity and specificity by integrating signals from an image retrieval model.
Findings
Achieved state-of-the-art results in caption diversity and novelty.
Generated captions are more specific and varied compared to previous models.
Source code is publicly available for reproducibility.
Abstract
Image Captioning is a task that requires models to acquire a multi-modal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram metrics, these models tend to output the same generic captions for similar images. In this work, we address this limitation and train a model that generates more diverse and specific captions through an unsupervised training approach that incorporates a learning signal from an Image Retrieval model. We summarize previous results and improve the state-of-the-art on caption diversity and novelty. We make our source code publicly available online.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
