Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints
Youjin Jung, Seongwoo Cho, Hyun-seok Min, Sungchul Choi

TL;DR
This paper introduces SoFT, a training-free filtering module that uses large language models to incorporate user constraints into zero-shot image retrieval, improving accuracy and handling ambiguity.
Contribution
It proposes a novel, plug-and-play filtering approach using LLM-derived constraints and a new benchmark pipeline for more reliable evaluation.
Findings
SoFT improves retrieval metrics significantly across multiple datasets.
The approach effectively incorporates prescriptive and proscriptive constraints.
Benchmark pipeline captures ambiguity and multiple plausible targets.
Abstract
Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text. While Zero-shot CIR (ZS-CIR) methods sidestep the need for labeled training data by leveraging pretrained vision-language models, they often rely on a single fused query that merges all descriptive cues of what the user wants, tending to dilute key information and failing to account for what they wish to avoid. Moreover, current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts. To address these challenges, we propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR. SoFT leverages multimodal large language models (LLMs) to extract two complementary constraints from the reference-modification pair: prescriptive (must-have) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
