Window Size Versus Accuracy Experiments in Voice Activity Detectors
Max McKinnon, Samir Khaki, Chandan KA Reddy, William Huang

TL;DR
This study evaluates how window size affects the accuracy of three voice activity detection algorithms across real-world audio streams, providing insights for optimizing VAD system performance.
Contribution
It offers a comparative analysis of VAD algorithms with different window sizes and explores hysteresis effects, which is a novel practical evaluation.
Findings
Silero outperforms WebRTC and RMS in accuracy.
Hysteresis improves WebRTC's performance.
Optimal window size varies across algorithms.
Abstract
Voice activity detection (VAD) plays a vital role in enabling applications such as speech recognition. We analyze the impact of window size on the accuracy of three VAD algorithms: Silero, WebRTC, and Root Mean Square (RMS) across a set of diverse real-world digital audio streams. We additionally explore the use of hysteresis on top of each VAD output. Our results offer practical references for optimizing VAD systems. Silero significantly outperforms WebRTC and RMS, and hysteresis provides a benefit for WebRTC.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
