Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval
Kaiyi Luo, Xulong Zhang, Jianzong Wang, Huaxiong Li, Ning Cheng, Jing, Xiao

TL;DR
This paper proposes CLSR, a novel audio-text cross-modal retrieval method that enhances contrastive learning with intra-modal separability, adaptive temperature control, and latent representation reconstruction, leading to improved retrieval performance.
Contribution
Introduces CLSR, a new approach that incorporates intra-modal separability, adaptive temperature adjustment, and latent reconstruction into audio-text retrieval.
Findings
Outperforms state-of-the-art methods on two datasets
Improves semantic alignment through latent reconstruction
Enhances intra-modal separability and adaptive contrastive learning
Abstract
Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech and Audio Processing
