# Zero-Shot Audio Classification Based on Class Label Embeddings

**Authors:** Huang Xie, Tuomas Virtanen

arXiv: 1905.01926 · 2019-08-08

## TL;DR

This paper introduces a zero-shot audio classification method using class label embeddings and a bilinear model, enabling recognition of audio classes without training samples and achieving promising accuracy on ESC-50.

## Contribution

It presents a novel zero-shot audio classification system leveraging semantic label embeddings and a bilinear model, improving recognition without target class audio samples.

## Key findings

- Achieves 26% average accuracy on ESC-50 for zero-shot classification.
- Outperforms random guessing (10%) across categories.
- Reaches up to 39.7% accuracy for natural audio classes.

## Abstract

This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 % on average) better than random guess (10 %) on each audio category. Particularly, it reaches up to 39.7 % for the category of natural audio classes.

---
Source: https://tomesphere.com/paper/1905.01926