L2C: Describing Visual Differences Needs Semantic Understanding of Individuals
An Yan, Xin Eric Wang, Tsu-Jui Fu, William Yang Wang

TL;DR
This paper introduces L2C, a model that enhances image comparison by incorporating semantic understanding, leading to more accurate descriptions of visual differences between image pairs.
Contribution
The paper proposes a novel Learning-to-Compare (L2C) model that integrates semantic understanding into image difference captioning, improving generalization and performance.
Findings
L2C outperforms baseline models on automatic and human evaluations.
Semantic representations improve comparison accuracy.
L2C generalizes better to new image pairs.
Abstract
Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs. Suppose there are two images, I_1 and I_2, and the task is to generate a description W_{1,2} comparing them, existing methods directly model { I_1, I_2 } -> W_{1,2} mapping without the semantic understanding of individuals. In this paper, we introduce a Learning-to-Compare (L2C) model, which learns to understand the semantic structures of these two images and compare them while learning to describe each one. We demonstrate that L2C benefits from a comparison between explicit semantic representations and single-image captions, and generalizes better on the new testing image pairs. It outperforms the baseline on both automatic evaluation and human evaluation for the Birds-to-Words dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
