Visual Delta Generator with Large Multi-modal Models for Semi-supervised   Composed Image Retrieval

Young Kyun Jang; Donghyun Kim; Zihang Meng; Dat Huynh; Ser-Nam Lim

arXiv:2404.15516·cs.CV·April 25, 2024

Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval

Young Kyun Jang, Donghyun Kim, Zihang Meng, Dat Huynh, Ser-Nam Lim

PDF

Open Access

TL;DR

This paper introduces a semi-supervised method for Composed Image Retrieval using a large multimodal model called Visual Delta Generator, which generates descriptive differences to enhance retrieval accuracy, outperforming existing methods.

Contribution

The paper presents a novel semi-supervised CIR approach utilizing a large language model-based Visual Delta Generator to generate pseudo triplets, improving retrieval performance and scalability.

Findings

01

Achieves state-of-the-art results on CIR benchmarks.

02

Significantly outperforms existing supervised approaches.

03

Demonstrates effectiveness of pseudo triplet generation.

Abstract

Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques