Robust Speech Activity Detection in the Presence of Singing Voice
Philipp Grundhuber, Mhd Modar Halimeh, Martin Strau{\ss}, Emanu\"el A. P. Habets

TL;DR
This paper presents SR-SAD, a neural network that improves speech activity detection accuracy in environments with singing, by using a specialized training strategy and a new evaluation metric, enabling more reliable performance in musical contexts.
Contribution
Introduction of SR-SAD, a neural network with a novel training approach and evaluation metric for robust speech detection amidst singing.
Findings
Achieved high speech detection accuracy (AUC = 0.919) in mixed speech-singing datasets.
Maintains robust performance while reducing inference runtime.
Effectively distinguishes speech from singing across multiple musical genres.
Abstract
Speech Activity Detection (SAD) systems often misclassify singing as speech, leading to degraded performance in applications such as dialogue enhancement and automatic speech recognition. We introduce Singing-Robust Speech Activity Detection ( SR-SAD ), a neural network designed to robustly detect speech in the presence of singing. Our key contributions are: i) a training strategy using controlled ratios of speech and singing samples to improve discrimination, ii) a computationally efficient model that maintains robust performance while reducing inference runtime, and iii) a new evaluation metric tailored to assess SAD robustness in mixed speech-singing scenarios. Experiments on a challenging dataset spanning multiple musical genres show that SR-SAD maintains high speech detection accuracy (AUC = 0.919) while rejecting singing. By explicitly learning to distinguish between speech and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders
