On Metric Learning for Audio-Text Cross-Modal Retrieval

Xinhao Mei; Xubo Liu; Jianyuan Sun; Mark D. Plumbley; Wenwu Wang

arXiv:2203.15537·eess.AS·July 1, 2022

On Metric Learning for Audio-Text Cross-Modal Retrieval

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

PDF

Open Access 1 Repo

TL;DR

This paper evaluates various metric learning objectives for audio-text cross-modal retrieval, finding that NT-Xent loss outperforms traditional triplet losses in stability and accuracy across datasets.

Contribution

The study provides an extensive comparison of metric learning losses for audio-text retrieval, highlighting the effectiveness of NT-Xent loss over triplet-based methods.

Findings

01

NT-Xent loss achieves stable performance across datasets.

02

NT-Xent outperforms triplet-based losses.

03

Audio-text retrieval remains less explored than image-text retrieval.

Abstract

Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two modalities. Existing cross-modal retrieval models are mostly optimized by metric learning objectives as both of them attempt to map data to an embedding space, where similar data are close together and dissimilar data are far apart. Unlike other cross-modal retrieval tasks such as image-text and video-text retrievals, audio-text retrieval is still an unexplored task. In this work, we aim to study the impact of different metric learning objectives on the audio-text retrieval task. We present an extensive evaluation of popular metric learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XinhaoMei/audio-text_retrieval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsNormalized Temperature-scaled Cross Entropy Loss · Contrastive Language-Image Pre-training