Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge   from Large Language Models

Xuenan Xu; Pingyue Zhang; Ming Yan; Ji Zhang; Mengyue Wu

arXiv:2407.14355·cs.SD·July 22, 2024·1 cites

Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models

Xuenan Xu, Pingyue Zhang, Ming Yan, Ji Zhang, Mengyue Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel zero-shot audio classification method that uses large language models to generate detailed sound attribute descriptions, improving recognition of unseen sound classes.

Contribution

It leverages large language models for attribute generation and contrastive learning to enhance zero-shot audio classification accuracy.

Findings

01

Significant accuracy improvements on VGGSound and AudioSet

02

Robust performance across different model architectures

03

Effective use of attribute descriptions for unseen classes

Abstract

Zero-shot audio classification aims to recognize and classify a sound class that the model has never seen during training. This paper presents a novel approach for zero-shot audio classification using automatically generated sound attribute descriptions. We propose a list of sound attributes and leverage large language model's domain knowledge to generate detailed attribute descriptions for each class. In contrast to previous works that primarily relied on class labels or simple descriptions, our method focuses on multi-dimensional innate auditory attributes, capturing different characteristics of sound classes. Additionally, we incorporate a contrastive learning approach to enhance zero-shot learning from textual labels. We validate the effectiveness of our method on VGGSound and AudioSet\footnote{The code is available at \url{https://www.github.com/wsntxxn/AttrEnhZsAc}.}. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wsntxxn/attrenhzsac
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis