Zero-Shot Audio Captioning Using Soft and Hard Prompts
Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua, Tan, Wenwu Wang, Zhanyu Ma

TL;DR
This paper introduces a zero-shot audio captioning approach leveraging contrastive language-audio pre-training, enabling caption generation without audio-text paired training data and improving cross-domain robustness.
Contribution
The method uses only textual data for training and employs soft and hard prompts to enhance cross-domain generalization in zero-shot audio captioning.
Findings
Outperforms existing zero-shot methods on AudioCaps and Clotho datasets.
Demonstrates strong cross-domain generalization capabilities.
Effective in both in-domain and cross-domain scenarios.
Abstract
In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space.In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
