TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio   Classification

Nishit Anand; Ashish Seth; Ramani Duraiswami; Dinesh Manocha

arXiv:2501.00398·cs.SD·April 4, 2025

TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification

Nishit Anand, Ashish Seth, Ramani Duraiswami, Dinesh Manocha

PDF

Open Access

TL;DR

TSPE is a training-free prompt ensemble method that enhances zero-shot audio classification by customizing and combining task-specific prompts, significantly improving model performance across diverse datasets.

Contribution

Introduces TSPE, a novel hard prompting technique that customizes prompts for specific audio tasks and ensembles them to boost zero-shot classification without additional training.

Findings

01

TSPE improves zero-shot performance by 1.23-16.36% across datasets.

02

Task-specific prompts outperform generic prompts in audio classification.

03

Ensembling prompts enhances audio-text alignment and classification accuracy.

Abstract

Audio-language models (ALMs) excel in zero-shot audio classification, a task where models classify previously unseen audio clips at test time by leveraging descriptive natural language prompts. We introduce TSPE (Task-Specific Prompt Ensemble), a simple, training-free hard prompting method that boosts ALEs' zero-shot performance by customizing prompts for diverse audio classification tasks. Rather than using generic template-based prompts like "Sound of a car" we generate context-rich prompts, such as "Sound of a car coming from a tunnel". Specifically, we leverage label information to identify suitable sound attributes, such as "loud" and "feeble", and appropriate sound sources, such as "tunnel" and "street" and incorporate this information into the prompts used by Audio-Language Models (ALMs) for audio classification. Further, to enhance audio-text alignment, we perform prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing