Audio-free Prompt Tuning for Language-Audio Models
Yiming Li, Xiangdong Wang, Hong Liu

TL;DR
This paper introduces an audio-free prompt tuning method for CLAP models that enhances zero-shot sound classification and transferability by optimizing text prompts instead of audio data, improving efficiency and scalability.
Contribution
It proposes a novel audio-free prompt tuning scheme leveraging modality alignment in CLAP, enabling better zero-shot performance without labeled audio data.
Findings
Boosts CLAP performance on several tasks
Outperforms other training methods in efficiency
Maintains transferability to unseen categories
Abstract
Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associate audio features with human language, making it a natural zero-shot classifier to recognize unseen sound categories. To adapt CLAP to downstream tasks, prior works inevitably require labeled domain audios, which limits their scalability under data scarcity and deprives them of the capability to detect novel classes as the original CLAP. In this work, by leveraging the modality alignment in CLAP, we propose an efficient audio-free prompt tuning scheme aimed at optimizing a few prompt tokens from texts instead of audios, which regularizes the model space to avoid overfitting the seen classes as well. Based on this, a multi-grained prompt design is further explored to fuse global and local information. Experiments on several tasks demonstrate that our approach can boost the CLAP and outperform other training methods on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
