Zero- and Few-shot Sound Event Localization and Detection

Kazuki Shimada; Kengo Uchida; Yuichiro Koyama; Takashi Shibuya,; Shusuke Takahashi; Yuki Mitsufuji; Tatsuya Kawahara

arXiv:2309.09223·cs.SD·January 19, 2024

Zero- and Few-shot Sound Event Localization and Detection

Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya,, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

PDF

Open Access

TL;DR

This paper introduces a novel embed-ACCDOA model for zero- and few-shot sound event localization and detection, enabling flexible class customization and achieving competitive results without extensive retraining.

Contribution

The paper proposes the embed-ACCDOA model that combines CLAP embeddings with DOA estimation for zero- and few-shot SELD, addressing class adaptation and overlapping event challenges.

Findings

01

embed-ACCDOA outperforms baseline combinations in location scores

02

Zero- and few-shot models match baseline performance trained on full data

03

Effective class adaptation with minimal samples

Abstract

Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis