Technical Report: Image Captioning with Semantically Similar Images

Martin Kol\'a\v{r}; Michal Hradi\v{s}; Pavel Zem\v{c}\'ik

arXiv:1506.03995·cs.CV·June 15, 2015·5 cites

Technical Report: Image Captioning with Semantically Similar Images

Martin Kol\'a\v{r}, Michal Hradi\v{s}, Pavel Zem\v{c}\'ik

PDF

Open Access

TL;DR

This paper introduces a method for image captioning that leverages semantically similar images and their typical captions, achieving competitive human-like performance despite low automated scores.

Contribution

The novel approach uses CNN embeddings to find similar images and select representative captions, offering a new perspective on image captioning techniques.

Findings

01

Method is competitive in Turing test ratios.

02

Captions pass human assessment more often than automated metrics.

03

Low automated scores but high human-perceived quality.

Abstract

This report presents our submission to the MS COCO Captioning Challenge 2015. The method uses Convolutional Neural Network activations as an embedding to find semantically similar images. From these images, the most typical caption is selected based on unigram frequencies. Although the method received low scores with automated evaluation metrics and in human assessed average correctness, it is competitive in the ratio of captions which pass the Turing test and which are assessed as better or equal to human captions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition