FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval
Jeong-Woo Park, Young-Eun Kim, and Seong-Whan Lee

TL;DR
FAR-Net is a multi-stage fusion framework for composed image retrieval that enhances semantic alignment and adaptive reconciliation between images and text, leading to improved retrieval accuracy.
Contribution
The paper introduces FAR-Net, combining two modules for better semantic alignment and robustness, advancing the fusion strategies in CIR tasks.
Findings
Improves Recall@1 by up to 2.4% on CIRR
Enhances Recall@50 by 1.04% on FashionIQ
Demonstrates robustness and scalability in CIR tasks
Abstract
Composed image retrieval (CIR) is a vision language task that retrieves a target image using a reference image and modification text, enabling intuitive specification of desired changes. While effectively fusing visual and textual modalities is crucial, existing methods typically adopt either early or late fusion. Early fusion tends to excessively focus on explicitly mentioned textual details and neglect visual context, whereas late fusion struggles to capture fine-grained semantic alignments between image regions and textual tokens. To address these issues, we propose FAR-Net, a multi-stage fusion framework designed with enhanced semantic alignment and adaptive reconciliation, integrating two complementary modules. The enhanced semantic alignment module (ESAM) employs late fusion with cross-attention to capture fine-grained semantic relationships, while the adaptive reconciliation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques
