Improvement of Noise-Robust Single-Channel Voice Activity Detection with   Spatial Pre-processing

Max V{\ae}hrens; Andreas Jonas Fuglsig; Anders Post Jacobsen; Nicolai; Almskou Rasmussen; Victor M{\o}lbach Nissen; Joachim Roland Hejslet and; Zheng-Hua Tan

arXiv:2104.05481·eess.AS·April 13, 2021

Improvement of Noise-Robust Single-Channel Voice Activity Detection with Spatial Pre-processing

Max V{\ae}hrens, Andreas Jonas Fuglsig, Anders Post Jacobsen, Nicolai, Almskou Rasmussen, Victor M{\o}lbach Nissen, Joachim Roland Hejslet and, Zheng-Hua Tan

PDF

Open Access

TL;DR

This paper enhances single-channel voice activity detection (VAD) in noisy environments by applying spatial pre-processing techniques, such as beamforming and spatial detection, leading to significant improvements over traditional methods and even multi-channel VAD in challenging conditions.

Contribution

The study introduces novel spatial pre-processing methods to improve single-channel VAD, demonstrating superior noise robustness compared to existing approaches.

Findings

01

Spatial detector significantly improves VAD accuracy.

02

Pre-processing methods outperform baseline MVAD in noisy conditions.

03

SVAD with spatial pre-processing is effective across various noise types.

Abstract

Voice activity detection (VAD) remains a challenge in noisy environments. With access to multiple microphones, prior studies have attempted to improve the noise robustness of VAD by creating multi-channel VAD (MVAD) methods. However, MVAD is relatively new compared to single-channel VAD (SVAD), which has been thoroughly developed in the past. It might therefore be advantageous to improve SVAD methods with pre-processing to obtain superior VAD, which is under-explored. This paper improves SVAD through two pre-processing methods, a beamformer and a spatial target speaker detector. The spatial detector sets signal frames to zero when no potential speaker is present within a target direction. The detector may be implemented as a filter, meaning the input signal for the SVAD is filtered according to the detector's output; or it may be implemented as a spatial VAD to be combined with the SVAD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing