Contrastive Latent Space Reconstruction Learning for Audio-Text   Retrieval

Kaiyi Luo; Xulong Zhang; Jianzong Wang; Huaxiong Li; Ning Cheng; Jing; Xiao

arXiv:2309.08839·cs.SD·September 19, 2023

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Kaiyi Luo, Xulong Zhang, Jianzong Wang, Huaxiong Li, Ning Cheng, Jing, Xiao

PDF

Open Access

TL;DR

This paper proposes CLSR, a novel audio-text cross-modal retrieval method that enhances contrastive learning with intra-modal separability, adaptive temperature control, and latent representation reconstruction, leading to improved retrieval performance.

Contribution

Introduces CLSR, a new approach that incorporates intra-modal separability, adaptive temperature adjustment, and latent reconstruction into audio-text retrieval.

Findings

01

Outperforms state-of-the-art methods on two datasets

02

Improves semantic alignment through latent reconstruction

03

Enhances intra-modal separability and adaptive contrastive learning

Abstract

Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech and Audio Processing