Language-Based Audio Retrieval with Converging Tied Layers and   Contrastive Loss

Andrew Koh; Eng Siong Chng

arXiv:2206.14659·cs.SD·June 30, 2022

Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

Andrew Koh, Eng Siong Chng

PDF

Open Access

TL;DR

This paper introduces a scalable, memory-efficient architecture for language-based audio retrieval that leverages tied encoders and contrastive loss, outperforming baseline models without finetuning pretrained components.

Contribution

The paper presents a novel tied encoder architecture with contrastive loss for audio retrieval, achieving superior performance and low memory usage without finetuning pretrained models.

Findings

01

Significant performance improvement over baseline models.

02

Low training memory requirement.

03

Effective use of pretrained models without finetuning.

Abstract

In this paper, we tackle the new Language-Based Audio Retrieval task proposed in DCASE 2022. Firstly, we introduce a simple, scalable architecture which ties both the audio and text encoder together. Secondly, we show that using this architecture along with contrastive loss allows the model to significantly beat the performance of the baseline model. Finally, in addition to having an extremely low training memory requirement, we are able to use pretrained models as it is without needing to finetune them. We test our methods and show that using a combination of our methods beats the baseline scores significantly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsTest