Cross-Modal Denoising: A Novel Training Paradigm for Enhancing   Speech-Image Retrieval

Lifeng Zhou; Yuke Li; Rui Deng; Yuting Yang; Haoqi Zhu

arXiv:2408.13705·cs.CL·September 12, 2024

Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval

Lifeng Zhou, Yuke Li, Rui Deng, Yuting Yang, Haoqi Zhu

PDF

Open Access

TL;DR

This paper introduces a cross-modal denoising training paradigm that improves speech-image retrieval by enabling finer alignment between modalities, outperforming existing methods without increasing inference complexity.

Contribution

We propose a novel cross-modal denoising task during training that enhances fine-grained alignment in speech-image retrieval, without affecting inference speed.

Findings

01

Outperforms state-of-the-art by 2.0% in mean R@1 on Flickr8k

02

Achieves 1.7% improvement in mean R@1 on SpokenCOCO

03

Effective in capturing fine-grained cross-modal details

Abstract

The success of speech-image retrieval relies on establishing an effective alignment between speech and image. Existing methods often model cross-modal interaction through simple cosine similarity of the global feature of each modality, which fall short in capturing fine-grained details within modalities. To address this issue, we introduce an effective framework and a novel learning task named cross-modal denoising (CMD) to enhance cross-modal interaction to achieve finer-level cross-modal alignment. Specifically, CMD is a denoising task designed to reconstruct semantic features from noisy features within one modality by interacting features from another modality. Notably, CMD operates exclusively during model training and can be removed during inference without adding extra inference time. The experimental results demonstrate that our framework outperforms the state-of-the-art method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Image Retrieval and Classification Techniques · Music and Audio Processing