Zero-Shot Audio Classification Based on Class Label Embeddings
Huang Xie, Tuomas Virtanen

TL;DR
This paper introduces a zero-shot audio classification method using class label embeddings and a bilinear model, enabling recognition of audio classes without training samples and achieving promising accuracy on ESC-50.
Contribution
It presents a novel zero-shot audio classification system leveraging semantic label embeddings and a bilinear model, improving recognition without target class audio samples.
Findings
Achieves 26% average accuracy on ESC-50 for zero-shot classification.
Outperforms random guessing (10%) across categories.
Reaches up to 39.7% accuracy for natural audio classes.
Abstract
This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 % on average) better than random guess (10 %) on each audio category. Particularly, it reaches up to 39.7 % for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
