Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee
Shuangjie Xu, Feng Xu, Yu Cheng, Pan Zhou

TL;DR
This paper introduces a new task of describing differences between image pairs in natural language, proposing a novel encoder-decoder framework with a discriminating referee, and provides a large annotated dataset for this purpose.
Contribution
It presents a novel pairwise image difference captioning framework with innovative feature fusion techniques and introduces the first dataset for relative difference captioning.
Findings
Our model outperforms existing methods on two datasets.
The new dataset contains 26,710 image pairs with free language descriptions.
The proposed techniques improve the accuracy of describing image differences.
Abstract
In this paper, we investigate a novel problem of telling the difference between image pairs in natural language. Compared to previous approaches for single image captioning, it is challenging to fetch linguistic representation from two independent visual information. To this end, we have proposed an effective encoder-decoder caption framework based on Hyper Convolution Net. In addition, a series of novel feature fusing techniques for pairwise visual information fusing are introduced and a discriminating referee is proposed to evaluate the pipeline. Because of the lack of appropriate datasets to support this task, we have collected and annotated a large new dataset with Amazon Mechanical Turk (AMT) for generating captions in a pairwise manner (with 14764 images and 26710 image pairs in total). The dataset is the first one on the relative difference caption task that provides descriptions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsConvolution
