ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol, Nieto, Ramani Duraiswami, Dinesh Manocha

TL;DR
This paper introduces ReCLAP, a model that enhances zero-shot audio classification by using descriptive prompts and rewritten captions, significantly improving performance over existing methods.
Contribution
ReCLAP is trained with rewritten captions describing sounds' unique features and employs prompt augmentation, leading to substantial improvements in zero-shot audio classification accuracy.
Findings
ReCLAP outperforms all baselines on multi-modal audio-text retrieval.
ReCLAP improves zero-shot audio classification by 1%-18%.
The method outperforms baselines by 1%-55%.
Abstract
Open-vocabulary audio-language models, like CLAP, offer a promising approach for zero-shot audio classification (ZSAC) by enabling classification with any arbitrary set of categories specified with natural language prompts. In this paper, we propose a simple but effective method to improve ZSAC with CLAP. Specifically, we shift from the conventional method of using prompts with abstract category labels (e.g., Sound of an organ) to prompts that describe sounds using their inherent descriptive features in a diverse context (e.g.,The organ's deep and resonant tones filled the cathedral.). To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild. These rewritten captions describe each sound event in the original caption using their unique discriminative characteristics. ReCLAP outperforms all baselines on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsSparse Evolutionary Training
