TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification
Nishit Anand, Ashish Seth, Ramani Duraiswami, Dinesh Manocha

TL;DR
TSPE is a training-free prompt ensemble method that enhances zero-shot audio classification by customizing and combining task-specific prompts, significantly improving model performance across diverse datasets.
Contribution
Introduces TSPE, a novel hard prompting technique that customizes prompts for specific audio tasks and ensembles them to boost zero-shot classification without additional training.
Findings
TSPE improves zero-shot performance by 1.23-16.36% across datasets.
Task-specific prompts outperform generic prompts in audio classification.
Ensembling prompts enhances audio-text alignment and classification accuracy.
Abstract
Audio-language models (ALMs) excel in zero-shot audio classification, a task where models classify previously unseen audio clips at test time by leveraging descriptive natural language prompts. We introduce TSPE (Task-Specific Prompt Ensemble), a simple, training-free hard prompting method that boosts ALEs' zero-shot performance by customizing prompts for diverse audio classification tasks. Rather than using generic template-based prompts like "Sound of a car" we generate context-rich prompts, such as "Sound of a car coming from a tunnel". Specifically, we leverage label information to identify suitable sound attributes, such as "loud" and "feeble", and appropriate sound sources, such as "tunnel" and "street" and incorporate this information into the prompts used by Audio-Language Models (ALMs) for audio classification. Further, to enhance audio-text alignment, we perform prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
