Technical Report: Image Captioning with Semantically Similar Images
Martin Kol\'a\v{r}, Michal Hradi\v{s}, Pavel Zem\v{c}\'ik

TL;DR
This paper introduces a method for image captioning that leverages semantically similar images and their typical captions, achieving competitive human-like performance despite low automated scores.
Contribution
The novel approach uses CNN embeddings to find similar images and select representative captions, offering a new perspective on image captioning techniques.
Findings
Method is competitive in Turing test ratios.
Captions pass human assessment more often than automated metrics.
Low automated scores but high human-perceived quality.
Abstract
This report presents our submission to the MS COCO Captioning Challenge 2015. The method uses Convolutional Neural Network activations as an embedding to find semantically similar images. From these images, the most typical caption is selected based on unigram frequencies. Although the method received low scores with automated evaluation metrics and in human assessed average correctness, it is competitive in the ratio of captions which pass the Turing test and which are assessed as better or equal to human captions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
