OneDiff: A Generalist Model for Image Difference Captioning

Erdong Hu; Longteng Guo; Tongtian Yue; Zijia Zhao; Shuning Xue; Jing Liu

arXiv:2407.05645·cs.CV·May 27, 2025

OneDiff: A Generalist Model for Image Difference Captioning

Erdong Hu, Longteng Guo, Tongtian Yue, Zijia Zhao, Shuning Xue, Jing Liu

PDF

Open Access

TL;DR

OneDiff is a versatile generalist model for image difference captioning that outperforms existing methods by integrating a vision-language architecture with a new dataset and training strategy, enabling precise and adaptable difference descriptions.

Contribution

The paper introduces OneDiff, a novel generalist IDC model with a dual-phase training strategy and a new dataset, improving accuracy and robustness over specialized prior models.

Findings

01

Outperforms state-of-the-art IDC models by up to 97% CIDEr points

02

Uses a siamese encoder with Visual Delta Module for fine-grained differences

03

Demonstrates robustness across diverse IDC benchmarks

Abstract

In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied contexts. This paper introduces the OneDiff model, a novel generalist approach that utilizes a robust vision-language model architecture, integrating a siamese image encoder with a Visual Delta Module. This innovative configuration allows for the precise detection and articulation of fine-grained differences between image pairs. OneDiff is trained through a dual-phase strategy, encompassing Coupled Sample Training and multi-task learning across a diverse array of data types, supported by our newly developed DiffCap Dataset. This dataset merges real-world and synthetic data, enhancing the training process and bolstering the model's robustness.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques