Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification
Ysobel Sims, Alexandre Mendes, Stephan Chalup

TL;DR
This paper introduces a novel diffusion model for zero-shot environmental sound classification, demonstrating superior performance over existing generative methods across multiple audio datasets.
Contribution
It adapts and evaluates a diffusion model for zero-shot learning in environmental audio, establishing a new benchmark and advancing generative approaches in this domain.
Findings
Diffusion model outperforms baselines on six audio datasets.
Generative methods improve zero-shot environmental sound classification.
First benchmark of generative methods in this field.
Abstract
Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Ultrasonics and Acoustic Wave Propagation
MethodsDiffusion
