ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval

Zhuocheng Zhang; Xingwu Zhang; Kangheng Liang; Guanxuan Li; Richard Mccreadie; Zijun Long

arXiv:2603.21886·cs.IR·March 24, 2026

ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval

Zhuocheng Zhang, Xingwu Zhang, Kangheng Liang, Guanxuan Li, Richard Mccreadie, Zijun Long

PDF

Open Access

TL;DR

This paper introduces ADaFuSE, a novel adaptive fusion model for diffusion-based text-to-image retrieval that improves performance and robustness by dynamically balancing and calibrating multi-modal information without altering existing frameworks.

Contribution

The paper proposes ADaFuSE, a lightweight, adaptive fusion mechanism with semantic-aware experts that enhances diffusion-augmented I-TIR by effectively aligning multi-modal views and reducing noise impact.

Findings

01

Achieves state-of-the-art performance on four I-TIR benchmarks.

02

Surpasses DAR by up to 3.49% in Hits@10 with minimal parameter increase.

03

Demonstrates robustness to noisy and longer interactive queries.

Abstract

Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques