Multi-label Zero-Shot Audio Classification with Temporal Attention
Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen

TL;DR
This paper introduces a novel multi-label zero-shot audio classification method using temporal attention to focus on relevant audio segments, improving accuracy over previous aggregated feature approaches.
Contribution
The study presents a new approach that applies temporal attention to enhance multi-label zero-shot audio classification, addressing the challenge of classifying multiple unseen sound classes.
Findings
Temporal attention improves classification accuracy.
Method outperforms baseline models on AudioSet subset.
Enhances zero-shot learning in multi-label audio tasks.
Abstract
Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need
