An Ensemble SVM-based Approach for Voice Activity Detection
Jayanta Dey, Md Sanzid Bin Hossain, Mohammad Ariful Haque

TL;DR
This paper introduces an ensemble SVM approach for voice activity detection that achieves high accuracy comparable to neural networks while maintaining low complexity, suitable for speech processing applications.
Contribution
The paper proposes a novel ensemble SVM training method for large datasets in VAD, improving accuracy and efficiency over traditional SVMs.
Findings
Ensemble SVM achieves 88.74% accuracy on VAD.
Compared to stand-alone SVM (57.05%), ensemble SVM significantly improves performance.
Ensemble SVM's accuracy is comparable to neural networks (86.28%).
Abstract
Voice activity detection (VAD), used as the front end of speech enhancement, speech and speaker recognition algorithms, determines the overall accuracy and efficiency of the algorithms. Therefore, a VAD with low complexity and high accuracy is highly desirable for speech processing applications. In this paper, we propose a novel training method on large dataset for supervised learning-based VAD system using support vector machine (SVM). Despite of high classification accuracy of support vector machines (SVM), trivial SVM is not suitable for classification of large data sets needed for a good VAD system because of high training complexity. To overcome this problem, a novel ensemble-based approach using SVM has been proposed in this paper.The performance of the proposed ensemble structure has been compared with a feedforward neural network (NN). Although NN performs better than single…
| Layer Details | Node Number | Activation | Epochs | Batch Size | Validation Split |
|---|---|---|---|---|---|
| Hidden Layer 1 | 12 | relu | |||
| Hidden Layer 2 | 8 | relu | 100 | 100 | .2 |
| Output Layer | 1 | sigmoid |
| Ensemble Member No. | Accuracy | Average individual member accuracy |
|---|---|---|
| 1 | 57.05% | |
| 2 | 72.25% | |
| 3 | 78.32% | 57.05% |
| 4 | 87.02% | |
| 5 | 88.74% | |
| 6 | 88.82% |
| File Name | Ensemble Member Accuracy | Total Ensemble Accuracy |
|---|---|---|
| 8.6466% | ||
| 8.6466% | ||
| speech-librivox-0011 | 91.3534% | 90.2256% |
| 39.4737% | ||
| 18.797% | ||
| 5.26316% | ||
| 30.8271% | ||
| noise-sound-bible-0031 | 21.4286% | 72.45% |
| 49.6241% | ||
| 68.797% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSupport Vector Machine
An Ensemble SVM-based Approach for Voice Activity Detection
Abstract
Voice activity detection (VAD), used as the front end of speech enhancement, speech and speaker recognition algorithms, determines the overall accuracy and efficiency of the algorithms. Therefore, a VAD with low complexity and high accuracy is highly desirable for speech processing applications. In this paper, we propose a novel training method on large dataset for supervised learning-based VAD system using support vector machine (SVM). Despite of high classification accuracy of support vector machines (SVM), trivial SVM is not suitable for classification of large data sets needed for a good VAD system because of high training complexity. To overcome this problem, a novel ensemble-based approach using SVM has been proposed in this paper. The performance of the proposed ensemble structure has been compared with a The performance of the proposed ensemble structure has been compared with a feedforward neural network (NN). Although NN performs better than single SVM-based VAD trained on a small portion of the training data, ensemble SVM gives accuracy comparable to neural network-based VAD. Ensemble SVM and NN give % and % accuracy respectively whereas the stand-alone SVM shows % accuracy on average on the test dataset.
Index Terms: Voice activity detection, support vector machine, neural network, ensemble.
1 Introduction
Voice activity detection is basically the act of separating the speech and the non-speech portions of an audio recording. As a typical speech signal may contain silence, mono-tonic noise and even music frames, sorting out the speech frames from the recording using an efficient VAD in the front end is of great importance for reliable operation of the speech processing algorithms.
A number of algorithms for voice activity detection have been proposed in the literature. Among them frame energy, periodicity measure [1] or entropy-based [2] methods are elegant in the sense of their simplicity and time-efficiency. However, they are based on parameters selection that are tuned for a particular situation and can not separate more critical non-speech frames like music accurately. A solution to this problem can be found by adopting relatively complex statistical approach such as statistical hypothesis testing [3], long-term spectral divergence measure [4], amplitude probability distribution [5] and low-variance spectrum estimation [6]. However, these methods need to estimate the background noise level and are also prone to several parameters tuning. More recent studies attempts to solve the problem of VAD from machine learning point-of-view [7], [8], [9] that classifies an audio frame as speech or non-speech. The main problem associated with these approaches is that they need to be trained on a large dataset which includes a rich non-speech instances for a satisfactory efficiency. This poses a problem for learners such as SVM that has high classification accuracy and yet can not be trained on a large dataset due to complex training algorithm.
In this paper, we propose a novel ensemble technique to train SVM learners on a large dataset for voice activity detection and compare it with the performance of a neural network-based classifier trained on the similar dataset. Time efficiency is a crucial factor for VAD implementation and unlike the other methods reported in the literature [10] that use a large set of features, we focus our efforts on building classifier based on MFCC features only. These MFCC features are popularly used for speech processing algorithms. Therefore, the extracted features can be used in the subsequent stages making the overall procedure more efficient. In our work, we have trained a number of SVMs on non-overlapping small datasets to cover the whole large dataset and their predicted probability is used as features for the output layer SVM that gives the final decision. The proposed SVM-ensemble gives approximately similar accuracy compared to the state of the art neural network [10]. This approach shows significant improvement in terms of accuracy from the stand-alone SVM and also the variance in result for a single SVM is smoothed out by the ensemble.
The paper is organized as follows. Section II describes problem formulation and data description. The system architecture is discussed in section III and the system efficacy is established in section IV. Finally, mentioning the contributions and our future work, the paper is concluded in section VI.
2 Problem Formulation and Data Description
In our work, we have used MUSAN corpus [11] as both our testing and training database. The corpus consists of approximately hours of speech, music and noise data that makes it an ideal database to be used for supervised learner-based VAD application. The speech dataset contains silent frames which are excluded from the training dataset by a log-mel energy based thresholding described later in the paper and a test dataset of duration of about hours has been separated from the corpus where the silent frames of speech recordings were annotated by a human listener. Now both the training and testing dataset were divided into frames of duration ms with ms overlap. Therefore, the challenge is to build a learner trained on the features extracted from the training frames that can classify the testing frames as speech or non-speech. Here we have used an ensemble-based approach for training an SVM learner and compared it with a trivial architecture-based neural network.
3 VAD System Description
In this work, we aim to develop a binary classification system in which one class consists of only speech and the other one contains silence, music and noise. A general overview of the VAD system is shown in Fig. 1. As the silent frames can be separated quite efficiently by energy-based thresholding only, they were excluded from the training data and detected in the testing phase using a thresholding on the log-mel energy that is the first MFCC feature. Then the remaining frames were classified using a supervised-learner-based VAD system.
3.1 Feature Extraction
In order to train the classifier, we have used MFCC features as they are the standard features used in speech processing. There are many software packages to extract the MFCC features efficiently and hence it will be an elegant feature-set to be used in VAD where time-efficiency is highly desirable. MFCC is a spectral feature inspired by human auditory model. As human ear is efficient in distinguishing between speech and non-speech, we hope the MFCC features will also be effective for our purpose. Here we have used MFCC features for a particular audio frame.
3.2 Classification using Neural Network
We have developed and evaluated our deep neural network (NN) model with python library Keras® and numerical computation software library tensorflow. We have trained a fully connected neural network model with an input layer of input variable ,two hidden layers with and neurons respectively and an output layer with one neuron. We have initialized network weights to small random numbers which was generated from a uniform distribution. Details of the layer architecture is given in Table 1. We have used binary crossentropy as loss function and gradient decent algorithm ‘Adam’ as optimizer.
3.3 Classification using SVM
In our work, we have used libsvm toolbox [12] for SVM-based classification. As a stand-alone SVM is not suitable for the large train dataset described earlier, we shuffle the feature sets obtained from the train dataset and divide it into non-overlapping smaller datasets. The overview of the procedure is shown in the Fig. 2. Here we have used an ensemble of SVMs with non-linear rbf kernel. The values of gamma and C parameters of the kernel have been tuned from the -fold cross-validation accuracy using grid-search [13]. The rationale behind choosing an ensemble of learners is that the performance saturates for higher number of learners. In the first stage, the feature dataset was shuffled and divided into portions. Among them the first segments has been used for training the ensemble-members and then each of the trained members gave probability estimates for the held-out -th portion of the dataset. For one feature vector of MFCC features each of the members gives probability estimate and thus a new feature space of features have been derived from one input feature vector which makes the procedure completely data-driven without using any heuristic thresholding. These feature vectors have been used for training an SVM classifier with rbf kernel in the final layer. Here instead of using a majority voting based decision, we have used SVM as majority voting system may give estimate biased to a particular ensemble member without considering the other members. In case of larger dataset even than that of used in this work, the number of SVM layers in the system may be increased similar to an NN architecture.
4 Result Analysis
The proposed ensemble-SVM was tested on the test dataset described in the data description section. To prove the efficacy of the classifier, we attempt to examine the ensemble architecture from different perspectives such as effect of layer members, stability and finally we compare the classifier with a stable, efficient neural network. In order to test the effect of ensembling, we gradually increase the number of ensemble member and observe their accuracy in Table 2.
From Table 2 we see that the accuracy of classification increases for increasing number of ensemble members and the accuracy almost saturates after ensemble members. Although individual SVM trained on smaller dataset shows poor performance of , the effect of ensembling is evident from the higher classification accuracies of the ensembles. Again to observe estimation variance reduction of the ensemble, we present the accuracy of two testing files in Table 3. For example, in the case of ‘speech-librivox-0011’ file, if we use stand-alone SVM the accuracy may vary in the range as the stand-alone SVMs are trained on different portions of data and hence, they may perform differently for a particular test-case. From Table 3 we can observe that the accuracy of individual SVM may fluctuate whereas their ensemble accuracy remains stable and close to the maximum individual accuracy.
Finally, we compare the performance of the ensemble SVM with the NN described in the subsection . NN and ensemble-SVM give % and % accuracy respectively. Their ROC curves are given in Fig. 3. The operating point of each classifier is shown by using a circle on the curves which shows that the ensemble-SVM has a better true positive rate of and a slightly high false positive rate of compared to that of NN. The average area under curve (AUC) for NN and ensemble-SVM are and respectively. From these performance indices, we can conclude that their performances are comparable.
5 Conclusion
In this paper, we have proposed a novel ensemble SVM-based approach for voice activity detection. The efficacy of the proposed ensembling method has been established in the result section through comparing with NN and testing on the test dataset. Here the member SVMs are independent of each other and hence they can operate parallelly resulting in a significant reduction in runtime. Again different ensemble member can be trained on different types of feature giving a more robust VAD system. Replacing some of the layers of an NN with SVM layer may result in improved accuracy. In our future work, we will explore composite structure of NN and SVM with reduced training complexity.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. Tucker, “Voice activity detection using a periodicity measure,” IEE Proceedings I (Communications, Speech and Vision) , vol. 139, no. 4, pp. 377–380, 1992.
- 2[2] P. Renevey and A. Drygajlo, “Entropy based voice activity detection in very noisy conditions,” in Seventh European Conference on Speech Communication and Technology , 2001.
- 3[3] J.-H. Chang, N. S. Kim, and S. K. Mitra, “Voice activity detection based on multiple statistical models,” IEEE Transactions on Signal Processing , vol. 54, no. 6, pp. 1965–1976, 2006.
- 4[4] J. Ramırez, J. C. Segura, C. Benıtez, A. De La Torre, and A. Rubio, “Efficient voice activity detection algorithms using long-term speech information,” Speech communication , vol. 42, no. 3-4, pp. 271–287, 2004.
- 5[5] S. G. Tanyer and H. Ozer, “Voice activity detection in nonstationary noise,” IEEE Transactions on speech and audio processing , vol. 8, no. 4, pp. 478–482, 2000.
- 6[6] A. Davis, S. Nordholm, and R. Togneri, “Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 14, no. 2, pp. 412–424, 2006.
- 7[7] J. W. Shin, J.-H. Chang, and N. S. Kim, “Voice activity detection based on statistical models and machine learning approaches,” Computer Speech & Language , vol. 24, no. 3, pp. 515–530, 2010.
- 8[8] J. Wu and X.-L. Zhang, “Maximum margin clustering based statistical vad with multiple observation compound feature,” IEEE Signal Processing Letters , vol. 18, no. 5, pp. 283–286, 2011.
