Unified Audio Event Detection
Yidi Jiang, Ruijie Tao, Wen Huang, Qian Chen, Wen Wang

TL;DR
This paper introduces Unified Audio Event Detection (UAED), a comprehensive framework that simultaneously detects speech and non-speech sounds, leveraging task synergy with a Transformer-based model to outperform separate models.
Contribution
The paper proposes a novel UAED task and a Transformer-based T-UAED framework that jointly models speech and non-speech sounds, improving over baseline methods.
Findings
T-UAED outperforms baseline combining SED and SD outputs.
T-UAED performs comparably to specialized models for individual tasks.
The framework effectively exploits task interactions.
Abstract
Sound Event Detection (SED) detects regions of sound events, while Speaker Diarization (SD) segments speech conversations attributed to individual speakers. In SED, all speaker segments are classified as a single speech event, while in SD, non-speech sounds are treated merely as background noise. Thus, both tasks provide only partial analysis in complex audio scenarios involving both speech conversation and non-speech sounds. In this paper, we introduce a novel task called Unified Audio Event Detection (UAED) for comprehensive audio analysis. UAED explores the synergy between SED and SD tasks, simultaneously detecting non-speech sound events and fine-grained speech events based on speaker identities. To tackle this task, we propose a Transformer-based UAED (T-UAED) framework and construct the UAED Data derived from the Librispeech dataset and DESED soundbank. Experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
