Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating   Referee

Shuangjie Xu; Feng Xu; Yu Cheng; Pan Zhou

arXiv:1910.06426·cs.CV·October 16, 2019·1 cites

Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee

Shuangjie Xu, Feng Xu, Yu Cheng, Pan Zhou

PDF

Open Access

TL;DR

This paper introduces a new task of describing differences between image pairs in natural language, proposing a novel encoder-decoder framework with a discriminating referee, and provides a large annotated dataset for this purpose.

Contribution

It presents a novel pairwise image difference captioning framework with innovative feature fusion techniques and introduces the first dataset for relative difference captioning.

Findings

01

Our model outperforms existing methods on two datasets.

02

The new dataset contains 26,710 image pairs with free language descriptions.

03

The proposed techniques improve the accuracy of describing image differences.

Abstract

In this paper, we investigate a novel problem of telling the difference between image pairs in natural language. Compared to previous approaches for single image captioning, it is challenging to fetch linguistic representation from two independent visual information. To this end, we have proposed an effective encoder-decoder caption framework based on Hyper Convolution Net. In addition, a series of novel feature fusing techniques for pairwise visual information fusing are introduced and a discriminating referee is proposed to evaluate the pipeline. Because of the lack of appropriate datasets to support this task, we have collected and annotated a large new dataset with Amazon Mechanical Turk (AMT) for generating captions in a pairwise manner (with 14764 images and 26710 image pairs in total). The dataset is the first one on the relative difference caption task that provides descriptions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsConvolution