Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss
Andrew Koh, Eng Siong Chng

TL;DR
This paper introduces a scalable, memory-efficient architecture for language-based audio retrieval that leverages tied encoders and contrastive loss, outperforming baseline models without finetuning pretrained components.
Contribution
The paper presents a novel tied encoder architecture with contrastive loss for audio retrieval, achieving superior performance and low memory usage without finetuning pretrained models.
Findings
Significant performance improvement over baseline models.
Low training memory requirement.
Effective use of pretrained models without finetuning.
Abstract
In this paper, we tackle the new Language-Based Audio Retrieval task proposed in DCASE 2022. Firstly, we introduce a simple, scalable architecture which ties both the audio and text encoder together. Secondly, we show that using this architecture along with contrastive loss allows the model to significantly beat the performance of the baseline model. Finally, in addition to having an extremely low training memory requirement, we are able to use pretrained models as it is without needing to finetune them. We test our methods and show that using a combination of our methods beats the baseline scores significantly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsTest
