Audio-free Prompt Tuning for Language-Audio Models

Yiming Li; Xiangdong Wang; Hong Liu

arXiv:2309.08357·eess.AS·September 18, 2023

Audio-free Prompt Tuning for Language-Audio Models

Yiming Li, Xiangdong Wang, Hong Liu

PDF

Open Access

TL;DR

This paper introduces an audio-free prompt tuning method for CLAP models that enhances zero-shot sound classification and transferability by optimizing text prompts instead of audio data, improving efficiency and scalability.

Contribution

It proposes a novel audio-free prompt tuning scheme leveraging modality alignment in CLAP, enabling better zero-shot performance without labeled audio data.

Findings

01

Boosts CLAP performance on several tasks

02

Outperforms other training methods in efficiency

03

Maintains transferability to unseen categories

Abstract

Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associate audio features with human language, making it a natural zero-shot classifier to recognize unseen sound categories. To adapt CLAP to downstream tasks, prior works inevitably require labeled domain audios, which limits their scalability under data scarcity and deprives them of the capability to detect novel classes as the original CLAP. In this work, by leveraging the modality alignment in CLAP, we propose an efficient audio-free prompt tuning scheme aimed at optimizing a few prompt tokens from texts instead of audios, which regularizes the model space to avoid overfitting the seen classes as well. Based on this, a multi-grained prompt design is further explored to fuse global and local information. Experiments on several tasks demonstrate that our approach can boost the CLAP and outperform other training methods on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis