Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval
Benno Weck, Miguel P\'erez Fern\'andez, Holger Kirchhoff, Xavier Serra

TL;DR
This paper explores transfer-learning strategies for cross-modal text-audio retrieval using large-scale pretrained models, focusing on embedding alignment, model fine-tuning, and the impact of noisy web data.
Contribution
It introduces a transfer-learning framework combining pretrained RoBERTa and PANNs models with metric learning, highlighting the importance of loss functions and fine-tuning for improved retrieval performance.
Findings
Pretraining with noisy web data enhances model generalization.
Proper loss function selection is crucial for effective embedding alignment.
Fine-tuning pretrained models significantly improves retrieval accuracy.
Abstract
We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text. Shallow neural networks map the embeddings to a common dimensionality. Our system, which is an extension of our submission to the Language-based Audio Retrieval Task of the DCASE Challenge 2022, employs the RoBERTa foundation model as the text embedding extractor. A pretrained PANNs model extracts the audio embeddings. To improve the generalisation of our model, we investigate how pretraining with audio and associated noisy text collected from the online platform Freesound improves the performance of our method. Furthermore, our ablation study reveals that the proper choice of the loss function and fine-tuning the pretrained models are essential in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · WordPiece · Adam · Dense Connections · Weight Decay · Dropout · Linear Warmup With Linear Decay · Layer Normalization
