Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval
Lifeng Zhou, Yuke Li, Rui Deng, Yuting Yang, Haoqi Zhu

TL;DR
This paper introduces a cross-modal denoising training paradigm that improves speech-image retrieval by enabling finer alignment between modalities, outperforming existing methods without increasing inference complexity.
Contribution
We propose a novel cross-modal denoising task during training that enhances fine-grained alignment in speech-image retrieval, without affecting inference speed.
Findings
Outperforms state-of-the-art by 2.0% in mean R@1 on Flickr8k
Achieves 1.7% improvement in mean R@1 on SpokenCOCO
Effective in capturing fine-grained cross-modal details
Abstract
The success of speech-image retrieval relies on establishing an effective alignment between speech and image. Existing methods often model cross-modal interaction through simple cosine similarity of the global feature of each modality, which fall short in capturing fine-grained details within modalities. To address this issue, we introduce an effective framework and a novel learning task named cross-modal denoising (CMD) to enhance cross-modal interaction to achieve finer-level cross-modal alignment. Specifically, CMD is a denoising task designed to reconstruct semantic features from noisy features within one modality by interacting features from another modality. Notably, CMD operates exclusively during model training and can be removed during inference without adding extra inference time. The experimental results demonstrate that our framework outperforms the state-of-the-art method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Image Retrieval and Classification Techniques · Music and Audio Processing
