ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

Sreyan Ghosh; Sonal Kumar; Chandra Kiran Reddy Evuru; Oriol; Nieto; Ramani Duraiswami; Dinesh Manocha

arXiv:2409.09213·eess.AS·September 17, 2024

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol, Nieto, Ramani Duraiswami, Dinesh Manocha

PDF

Open Access 1 Repo

TL;DR

This paper introduces ReCLAP, a model that enhances zero-shot audio classification by using descriptive prompts and rewritten captions, significantly improving performance over existing methods.

Contribution

ReCLAP is trained with rewritten captions describing sounds' unique features and employs prompt augmentation, leading to substantial improvements in zero-shot audio classification accuracy.

Findings

01

ReCLAP outperforms all baselines on multi-modal audio-text retrieval.

02

ReCLAP improves zero-shot audio classification by 1%-18%.

03

The method outperforms baselines by 1%-55%.

Abstract

Open-vocabulary audio-language models, like CLAP, offer a promising approach for zero-shot audio classification (ZSAC) by enabling classification with any arbitrary set of categories specified with natural language prompts. In this paper, we propose a simple but effective method to improve ZSAC with CLAP. Specifically, we shift from the conventional method of using prompts with abstract category labels (e.g., Sound of an organ) to prompts that describe sounds using their inherent descriptive features in a diverse context (e.g.,The organ's deep and resonant tones filled the cathedral.). To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild. These rewritten captions describe each sound event in the original caption using their unique discriminative characteristics. ReCLAP outperforms all baselines on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sreyan88/reclap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training