Exploring Nearest Neighbor Approaches for Image Captioning
Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C., Lawrence Zitnick

TL;DR
This paper investigates nearest neighbor methods for image captioning, finding they perform comparably to recent models on automatic metrics but are less preferred by humans, highlighting the gap between automated and human evaluation.
Contribution
The study systematically evaluates nearest neighbor baselines for image captioning and compares their performance to state-of-the-art generative models.
Findings
Nearest neighbor approaches perform similarly to recent models on automatic metrics.
Human evaluations favor models that generate novel captions over nearest neighbor methods.
Nearest neighbor methods are competitive baselines for image captioning tasks.
Abstract
We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images. When measured by automatic evaluation metrics on the MS COCO caption evaluation server, these approaches perform as well as many recent approaches that generate novel captions. However, human studies show that a method that generates novel captions is still preferred over the nearest neighbor approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
