Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Marianna Nezhurina,, Taylor Berg-Kirkpatrick, Shlomo Dubnov

TL;DR
This paper introduces a large-scale contrastive pretraining framework for audio and language, utilizing a new dataset and innovative feature fusion and augmentation techniques to improve multimodal audio understanding.
Contribution
The paper presents LAION-Audio-630K dataset and a contrastive pretraining model with feature fusion and keyword-to-caption augmentation for enhanced audio-language representation.
Findings
Superior performance in text-to-audio retrieval
State-of-the-art zero-shot audio classification results
Competitive supervised audio classification performance
Abstract
Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗laion/clap-htsat-fusedmodel· 25.4M dl· ♡ 6825.4M dl♡ 68
- 🤗laion/clap-htsat-unfusedmodel· 377k dl· ♡ 69377k dl♡ 69
- 🤗ybelkada/clap-model-cardmodel· ♡ 1♡ 1
- 🤗laion/larger_clap_musicmodel· 74k dl· ♡ 4374k dl♡ 43
- 🤗laion/larger_clap_music_and_speechmodel· 35k dl· ♡ 3735k dl♡ 37
- 🤗laion/larger_clap_generalmodel· 606k dl· ♡ 48606k dl♡ 48
- 🤗ArthurSynthia/clap1model· 2 dl2 dl
- 🤗weslie520/clap-htsat-fusedmodel· 5 dl5 dl
- 🤗derektan95/search-tta-soundmodel· 5 dl5 dl
- 🤗samueldashadrach/clap-htsat-unfused-endpointmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
