Large-scale Contrastive Language-Audio Pretraining with Feature Fusion   and Keyword-to-Caption Augmentation

Yusong Wu; Ke Chen; Tianyu Zhang; Yuchen Hui; Marianna Nezhurina,; Taylor Berg-Kirkpatrick; Shlomo Dubnov

arXiv:2211.06687·cs.SD·March 25, 2024·6 cites

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Marianna Nezhurina,, Taylor Berg-Kirkpatrick, Shlomo Dubnov

PDF

Open Access 4 Repos 10 Models

TL;DR

This paper introduces a large-scale contrastive pretraining framework for audio and language, utilizing a new dataset and innovative feature fusion and augmentation techniques to improve multimodal audio understanding.

Contribution

The paper presents LAION-Audio-630K dataset and a contrastive pretraining model with feature fusion and keyword-to-caption augmentation for enhanced audio-language representation.

Findings

01

Superior performance in text-to-audio retrieval

02

State-of-the-art zero-shot audio classification results

03

Competitive supervised audio classification performance

Abstract

Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing