Natural Language Supervision for General-Purpose Audio Representations

Benjamin Elizalde; Soham Deshmukh; Huaming Wang

arXiv:2309.05767·cs.SD·February 8, 2024

Natural Language Supervision for General-Purpose Audio Representations

Benjamin Elizalde, Soham Deshmukh, Huaming Wang

PDF

Open Access 1 Repo 1 Models 2 Datasets

TL;DR

This paper introduces a contrastive pretraining approach for audio-language models using 4.6 million audio-text pairs, resulting in improved zero-shot and downstream task performance across diverse audio applications.

Contribution

It proposes a novel contrastive language-audio pretraining framework with specialized encoders trained on multiple tasks, advancing general-purpose audio representation learning.

Findings

01

Achieved state-of-the-art results on several audio tasks.

02

Demonstrated strong zero-shot generalization across 26 tasks.

03

Improved downstream performance by leveraging diverse training data.

Abstract

Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose a Contrastive Language-Audio Pretraining model that is pretrained with a diverse collection of 4.6M audio-text pairs employing two innovative encoders for Zero-Shot inference. To learn audio representations, we trained an audio encoder on 22 audio tasks, instead of the standard training of sound event classification. To learn language representations, we trained an autoregressive decoder-only model instead of the standard encoder-only models. Then, the audio and language representations are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/clap
pytorchOfficial

Models

🤗
microsoft/msclap
model· ♡ 35
♡ 35

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing