Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Matteo Attimonelli; Alessandro De Bellis; Aryo Pradipta Gema; Rohit Saxena; Monica Sekoyan; Wai-Chung Kwan; Claudio Pomo; Alessandro Suglia; Dietmar Jannach; Tommaso Di Noia; Pasquale Minervini

arXiv:2605.14787·cs.CV·May 19, 2026

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Matteo Attimonelli, Alessandro De Bellis, Aryo Pradipta Gema, Rohit Saxena, Monica Sekoyan, Wai-Chung Kwan, Claudio Pomo, Alessandro Suglia, Dietmar Jannach, Tommaso Di Noia, Pasquale Minervini

PDF

TL;DR

This paper reveals that many composed image retrieval benchmarks can be solved using single modalities, indicating that current models may not truly require multimodal composition for high performance.

Contribution

The study uncovers the prevalence of unimodal shortcuts in CIR benchmarks and emphasizes the need for more rigorous evaluation of multimodal composition.

Findings

01

A large fraction of queries are solvable with a single modality (32.2% to 83.6%).

02

Re-evaluation on validated shortcut-free queries shows reliance on multimodal signals increases.

03

Current benchmarks conflate shortcut-solvable, noisy, and genuine compositional queries.

Abstract

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.