Eliminating Hallucination in Diffusion-Augmented Interactive Text-to-Image Retrieval
Zhuocheng Zhang, Kangheng Liang, Guanxuan Li, Paul Henderson, Richard Mccreadie, Zijun Long

TL;DR
This paper introduces DMCL, a training framework that enhances diffusion-augmented text-to-image retrieval by reducing hallucinations, leading to more accurate retrieval results across multiple benchmarks.
Contribution
The paper proposes DMCL, a novel contrastive learning method that filters hallucinated cues in diffusion-augmented retrieval, improving robustness and performance.
Findings
DMCL improves multi-round Hits@10 by up to 7.37%.
DMCL effectively filters hallucinated cues in diffusion-generated views.
The framework shows consistent gains across five standard benchmarks.
Abstract
Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR) is a promising paradigm that improves retrieval performance by generating query images via diffusion models and using them as additional ``views'' of the user's intent. However, these generative views can be incorrect because diffusion generation may introduce hallucinated visual cues that conflict with the original query text. Indeed, we empirically demonstrate that these hallucinated cues can substantially degrade DAI-TIR performance. To address this, we propose Diffusion-aware Multi-view Contrastive Learning (DMCL), a hallucination-robust training framework that casts DAI-TIR as joint optimization over representations of query intent and the target image. DMCL introduces semantic-consistency and diffusion-aware contrastive objectives to align textual and diffusion-generated query views while suppressing hallucinated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
