TL;DR
This paper introduces a new dataset and model for automatically describing differences between similar images, advancing the alignment of language and vision in visual comparison tasks.
Contribution
The paper presents a novel dataset of difference descriptions for image pairs and a model that improves over attention-based methods by explicitly capturing visual salience.
Findings
Proposed model outperforms attention-only models in single-sentence generation.
Dataset enables exploration of language-vision alignment and multi-sentence coherence.
Visual analysis reveals object-level difference clusters as a proxy for differences.
Abstract
In this paper, we introduce the task of automatically generating text to describe the differences between two similar images. We collect a new dataset by crowd-sourcing difference descriptions for pairs of image frames extracted from video-surveillance footage. Annotators were asked to succinctly describe all the differences in a short paragraph. As a result, our novel dataset provides an opportunity to explore models that align language and vision, and capture visual salience. The dataset may also be a useful benchmark for coherent multi-sentence generation. We perform a firstpass visual analysis that exposes clusters of differing pixels as a proxy for object-level differences. We propose a model that captures visual salience by using a latent variable to align clusters of differing pixels with output sentences. We find that, for both single-sentence generation and as well as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
