Do Composed Image Retrieval Benchmarks Require Multimodal Composition?
Matteo Attimonelli, Alessandro De Bellis, Aryo Pradipta Gema, Rohit Saxena, Monica Sekoyan, Wai-Chung Kwan, Claudio Pomo, Alessandro Suglia, Dietmar Jannach, Tommaso Di Noia, Pasquale Minervini

TL;DR
This paper reveals that many composed image retrieval benchmarks can be solved using single modalities, indicating that current models may not truly require multimodal composition for high performance.
Contribution
The study uncovers the prevalence of unimodal shortcuts in CIR benchmarks and emphasizes the need for more rigorous evaluation of multimodal composition.
Findings
A large fraction of queries are solvable with a single modality (32.2% to 83.6%).
Re-evaluation on validated shortcut-free queries shows reliance on multimodal signals increases.
Current benchmarks conflate shortcut-solvable, noisy, and genuine compositional queries.
Abstract
Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
