OneDiff: A Generalist Model for Image Difference Captioning
Erdong Hu, Longteng Guo, Tongtian Yue, Zijia Zhao, Shuning Xue, Jing Liu

TL;DR
OneDiff is a versatile generalist model for image difference captioning that outperforms existing methods by integrating a vision-language architecture with a new dataset and training strategy, enabling precise and adaptable difference descriptions.
Contribution
The paper introduces OneDiff, a novel generalist IDC model with a dual-phase training strategy and a new dataset, improving accuracy and robustness over specialized prior models.
Findings
Outperforms state-of-the-art IDC models by up to 97% CIDEr points
Uses a siamese encoder with Visual Delta Module for fine-grained differences
Demonstrates robustness across diverse IDC benchmarks
Abstract
In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied contexts. This paper introduces the OneDiff model, a novel generalist approach that utilizes a robust vision-language model architecture, integrating a siamese image encoder with a Visual Delta Module. This innovative configuration allows for the precise detection and articulation of fine-grained differences between image pairs. OneDiff is trained through a dual-phase strategy, encompassing Coupled Sample Training and multi-task learning across a diverse array of data types, supported by our newly developed DiffCap Dataset. This dataset merges real-world and synthetic data, enhancing the training process and bolstering the model's robustness.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
