Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models
Xuenan Xu, Pingyue Zhang, Ming Yan, Ji Zhang, Mengyue Wu

TL;DR
This paper introduces a novel zero-shot audio classification method that uses large language models to generate detailed sound attribute descriptions, improving recognition of unseen sound classes.
Contribution
It leverages large language models for attribute generation and contrastive learning to enhance zero-shot audio classification accuracy.
Findings
Significant accuracy improvements on VGGSound and AudioSet
Robust performance across different model architectures
Effective use of attribute descriptions for unseen classes
Abstract
Zero-shot audio classification aims to recognize and classify a sound class that the model has never seen during training. This paper presents a novel approach for zero-shot audio classification using automatically generated sound attribute descriptions. We propose a list of sound attributes and leverage large language model's domain knowledge to generate detailed attribute descriptions for each class. In contrast to previous works that primarily relied on class labels or simple descriptions, our method focuses on multi-dimensional innate auditory attributes, capturing different characteristics of sound classes. Additionally, we incorporate a contrastive learning approach to enhance zero-shot learning from textual labels. We validate the effectiveness of our method on VGGSound and AudioSet\footnote{The code is available at \url{https://www.github.com/wsntxxn/AttrEnhZsAc}.}. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis
