FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval
Chenchen Zhao, Jianhuan Zhuo, Muxi Chen, Zhaohua Zhang, Wenyu Jiang, Tianwen Jiang, Qiuyong Xiao, Jihong Zhang, Qiang Xu

TL;DR
This paper introduces FBCIR, a method to interpret and balance cross-modal focus in composed image retrieval models, improving their robustness especially in challenging scenarios with hard negatives.
Contribution
The paper proposes FBCIR for focus interpretation and a data augmentation workflow to enhance cross-modal reasoning in CIR models, addressing focus imbalance issues.
Findings
Focus imbalances are common in existing CIR models.
Data augmentation with hard negatives improves model performance in difficult cases.
The approach maintains performance on standard benchmarks.
Abstract
Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
