A Real-Time Voice Activity Detection Based On Lightweight Neural

Jidong Jia; Pei Zhao; Di Wang

arXiv:2405.16797·cs.SD·May 28, 2024

A Real-Time Voice Activity Detection Based On Lightweight Neural

Jidong Jia, Pei Zhao, Di Wang

PDF

Open Access

TL;DR

This paper introduces MagicNet, a lightweight, real-time neural network for voice activity detection that operates efficiently without future context, achieving improved robustness with fewer parameters.

Contribution

The paper presents MagicNet, a novel lightweight neural network architecture for VAD that emphasizes operational efficiency and real-time performance without using future context.

Findings

01

MagicNet outperforms state-of-the-art algorithms in accuracy and robustness.

02

MagicNet has fewer parameters and lower latency.

03

MagicNet maintains high performance across diverse noise conditions.

Abstract

Voice activity detection (VAD) is the task of detecting speech in an audio stream, which is challenging due to numerous unseen noises and low signal-to-noise ratios in real environments. Recently, neural network-based VADs have alleviated the degradation of performance to some extent. However, the majority of existing studies have employed excessively large models and incorporated future context, while neglecting to evaluate the operational efficiency and latency of the models. In this paper, we propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU. Without relying on future features as input, our proposed model is compared with two state-of-the-art algorithms on synthesized in-domain and out-domain test datasets. The evaluation results demonstrate that MagicNet can achieve improved performance and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hand Gesture Recognition Systems

MethodsGated Recurrent Unit