Voice Activity Detection Scheme by Combining DNN Model with GMM Model

Lu Ma; Xiaomeng Zhang; Pei Zhao; Tengrong Su

arXiv:2005.08184·cs.SD·May 19, 2020

Voice Activity Detection Scheme by Combining DNN Model with GMM Model

Lu Ma, Xiaomeng Zhang, Pei Zhao, Tengrong Su

PDF

Open Access

TL;DR

This paper proposes a novel voice activity detection scheme that combines deep neural networks with Gaussian mixture models to enhance adaptability and performance in practical environments, especially under limited data and hardware constraints.

Contribution

It introduces a deeply integrated scheme that combines DNN and GMM models, with a feedback mechanism for real-time adaptation and a control scheme for speech endpoint detection.

Findings

01

Improved VAD accuracy in practical scenarios.

02

Enhanced model adaptability with limited training data.

03

Effective real-time model updates using combined feedback.

Abstract

Due to the superior modeling ability of deep neural network (DNN), it is widely used in voice activity detection (VAD). However, the performance may degrade if no sufficient data especially for practical data could be used for training, thus, leading to inferior ability of adaption to environment. Moreover, large model structure could not always be used in practical, especially for low cost devices where restricted hardware is used. This is on the contrary for Gaussian mixture model (GMM) where model parameters can be updated in real-time, but, with low modeling ability. In this paper, deeply integrated scheme combining these two models are proposed to improve adaptability and modeling ability. This is done by directly combining the results of models and feeding it back, together with the result of the DNN model, to update the GMM model. Besides, a control scheme is elaborately designed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition