Automatic Quality Control and Enhancement for Voice-Based Remote Parkinson's Disease Detection
Amir Hossein Poorjam, Mathew Shaji Kavalekalam, Liming Shi, Yordan P., Raykov, Jesper Rindom Jensen, Max A. Little, Mads Gr{\ae}sb{\o}ll Christensen

TL;DR
This paper develops automatic quality control methods for voice recordings to improve remote Parkinson's disease detection accuracy under various acoustic degradations like noise and reverberation.
Contribution
It introduces automatic quality assessment techniques to identify degradations and select suitable enhancement algorithms, enhancing PD detection in diverse acoustic conditions.
Findings
Quality control improves detection accuracy
Effective enhancement algorithm selection based on degradation type
Demonstrated robustness in real-world noisy environments
Abstract
The performance of voice-based Parkinson's disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering three types of degradation commonly encountered in remote voice analysis, namely background noise, reverberation and nonlinear distortion, and investigate how these degradations influence the performance of a PD detection system. Given that the specific degradation is known, we explore the effectiveness of a variety of enhancement algorithms in compensating this mismatch and improving the PD detection accuracy. Then, we propose two approaches to automatically control the quality of recordings by identifying the presence and type of short-term and long-term degradations and protocol violations in voice signals. Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\SetWatermarkText
Preprint
Automatic Quality Control and Enhancement for Voice-Based Remote Parkinson’s Disease Detection
Amir Hossein Poorjam, Student Member, IEEE, Mathew Shaji Kavalekalam, Student Member, IEEE, Liming Shi, Student Member, IEEE, Yordan P. Raykov, Jesper Rindom Jensen, Member, IEEE, Max A. Little, Member, IEEE and Mads Græsbøll Christensen, Senior Member, IEEE This work was funded by Independent Research Fund Denmark: DFF 4184-00056.A.H. Poorjam, M.S. Kavalekalam, L. Shi, J.R. Jensen and M.G. Christensen are with the Audio Analysis Lab, CREATE, Aalborg University, Aalborg 9000, Denmark (e-mail: {ahp,msk,ls,jrj,mgc}@create.aau.dk).M.A. Little is with the School of Engineering and Applied Science, Aston University, Birmingham, UK, and also with Media Lab, MIT, Cambridge, Massachusetts, USA (e-mail: [email protected]).Y.P. Raykov is with the School of Engineering and Applied Science, Aston University, Birmingham, UK. (e-mail: [email protected]).
Abstract
The performance of voice-based Parkinson’s disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering three types of degradation commonly encountered in remote voice analysis, namely background noise, reverberation and nonlinear distortion, and investigate how these degradations influence the performance of a PD detection system. Given that the specific degradation is known, we explore the effectiveness of a variety of enhancement algorithms in compensating this mismatch and improving the PD detection accuracy. Then, we propose two approaches to automatically control the quality of recordings by identifying the presence and type of short-term and long-term degradations and protocol violations in voice signals. Finally, we experiment with using the proposed quality control methods to inform the choice of enhancement algorithm. Experimental results using the voice recordings of the mPower mobile PD data set under different degradation conditions show the effectiveness of the quality control approaches in selecting an appropriate enhancement method and, consequently, in improving the PD detection accuracy. This study is a step towards the development of a remote PD detection system capable of operating in unseen acoustic environments.
Index Terms:
Acoustic Mismatch, Dereverberation, Parkinson’s Disease Detection, Speech Enhancement, Quality Control
I Introduction
Parkinson’s disease (PD) is a neurodegenerative disorder which progressively makes the patients unable to control their movement normally and, consequently, decreases the patients’ quality of life [1]. Since there is no cure for PD, it is necessary to develop tools to diagnose this disease in early stages in order to control its symptoms. Speech is known to reflect the PD symptoms since the majority of PD patients suffer from some forms of vocal disorder [2]. It has been demonstrated in [3] that early changes of clinical symptoms of PD are more reflected and pronounced in acoustic analysis of voice signals than in perceptual evaluation of voice by a therapist. This has motivated researchers to take advantage of advanced speech signal processing and machine learning algorithms to develop highly accurate and data-driven methods for detecting PD symptoms from voice signals [4, 5, 6]. Moreover, advances in smart phone technology provide new opportunities for remote monitoring of PD symptoms by bypassing the logistical and practical limitations of recording voice samples in controlled experimental conditions in clinics [7, 5]. However, there is a higher risk outside controlled lab conditions that participants may not adhere to the test protocols, which probe for specific symptoms, due to lack of training, misinterpretation of the test protocol or negligence. Moreover, voice signals in remote voice analysis might be subject to a variety of degradations during recording or transmission. Processing the degraded recordings or those which do not comply with the assumptions of the test protocol can produce misleading, non-replicable and non-reproducible results [8] that could have significant ramifications for the patients’ health. In addition, degradation of voice signals produces an acoustic mismatch between the training and operating conditions in automatic PD detection. A variety of techniques have been developed for compensating this type of mismatch in different speech-based applications [9, 10, 11, 12, 13, 14, 15] which can, in general, be categorized into four classes: (1) searching for robust features which parameterize speech regardless of degradations; (2) transforming a degraded signal to the acoustic condition of the training data using a signal enhancement algorithm111In this paper, by “signal enhancement”, we refer to all algorithms intended to enhance the quality of degraded signals.; (3) compensating the effects of degradation in the feature space by applying feature enhancement; and (4) transforming the parameters of the developed model to match the acoustic conditions of the degraded signal at operating time. However, to the best of the authors’ knowledge, there is a lack of studies of the impact of acoustic mismatch and the effect of compensation on the performance of PD detection systems. Vasquez-Correa et al. proposed a pre-processing scheme by applying a generalized subspace speech enhancement technique to the voiced and unvoiced segments of a speech signal to address the PD detection in non-controlled noise conditions [16]. They showed that applying speech enhancement to the unvoiced segments leads to an improvement in detection accuracy while the enhancement of voiced segments degrades the performance. However, this study is limited in terms of degradation types as it only considered the additive noise. Moreover, they only evaluated the impact of an unsupervised enhancement method on PD detection performance, while the supervised algorithms have, in general, shown to reconstruct higher quality signals as they incorporate more prior information about the speech and noise.
Another open question which, to the authors’ knowledge, has not been addressed is whether applying “appropriate” signal enhancement algorithms to the degraded signals will result in an improvement in PD detection performance. Answering this question, however, requires prior knowledge about the presence and type of degradation in voice signals, which can be achieved by controlling the quality of recordings prior to analysis. Quality control of the voice recordings is typically performed manually by human experts which is a very costly and time consuming task, and is often infeasible in online applications. In [17], the problem of quality control in remote speech data collection has been approached by identifying the potential outliers which are inconsistent, in terms of the quality and the context, with the majority of speech samples in a data set. Even though very effective in finding outliers, it is not capable of detecting the type of degradation nor identifying short-term protocol violations in recordings. To identify the type of degradation in pathological voices, Poorjam et al. proposed two different parametric and non-parametric approaches to classify degradations commonly encountered in remote pathological voice analysis into four major types, namely background noise, reverberation, clipping and coding [18, 19]. However, the performance of these approaches is limited when new degradation types are introduced. Furthermore, the presence of outlier recordings, which do not contain relevant information for PD detection due to long-term protocol violations, is not considered in these methods and, therefore, there is no control over the class assignment for such recordings. To address the frame-level quality control in pathological voices, Badawy et al. proposed a general framework for detecting short-term protocol violations using a nonparametric switching autoregressive model [20]. In [21], a highly accurate approach for identifying short-term protocol violations in PD voice recordings has been proposed which fits an infinite hidden Markov model to the frames of the voice signals in the mel-frequency cepstral domain. However, these two approaches do not identify short-term degradations (e.g. the presence of an instantaneous background noise) in voice signals.
To overcome the explained limitations in the existing methods, we propose two approaches for controlling the quality of pathological voices at recording-level and frame-level in this paper. In the recording-level approach, separate statistical models are fitted to the clean voice signals and the signals corrupted by different degradation types. The likelihood of a new observation given each of the models is then used to determine its degree of adherence to each class of acoustic conditions. This gives us the flexibility not only to associate multiple classes to a voice signal corrupted by a combination of different degradations, but also to consider a recording as an outlier or a new degradation when it is rejected by all the models. In the frame-level approach, on the other hand, we extend the work in [21] to identify short-term protocol violations and degradations in voice signals at the same time. We show how the proposed quality control approaches can effectively inform the choice of signal enhancement methods and, consequently, improve the PD detection performance. The contribution of this paper is thus three-fold: (1) we investigate the impact of acoustic mismatch between training and operating conditions, due to degradation in test signals, on the PD detection performance; (2) to identify this mismatch, we propose two different approaches to automatically control the quality of pathological voices at frame- and recording-level; and (3) to efficiently reduce this mismatch, given that the specific degradation is known, we explore a variety of state-of-the-art enhancement algorithms and their effectiveness in improving the performance of a PD detection system. The rest of the paper is organized as follows. Section II explains the PD detection system that we have used for the experiments throughout this paper. In Section III, we investigate the impact of three major types of signal degradation commonly encountered in remote voice analysis, namely noise, reverberation and nonlinear distortion, on the performance of the PD detection system. Following that, in Section IV, we investigate on the influence of noise reduction and dereverberation algorithms on the performance of the PD detection system. In Section V, we propose two different quality control approaches and investigate how these methods can improve the performance of PD detection. Finally, Section VI summarizes the paper.
II Parkinson’s Disease Detection System
In this section, we describe the PD detection system we will use for further quality control and enhancement experiments. This approach, which was proposed in [22], fits Gaussian mixture models (GMMs) to the frames of the voice recordings of the PD patients and the healthy controls (HC) parametrized by perceptual linear predictive (PLP) coefficients [23]. The motivation for using PLP parametrization is that the perceptual features are more discriminative in PD detection than the conventional and clinically interpretable ones (such as standard deviation of fundamental frequency, jitter, shimmer, harmonic-to-noise ratio, glottal-to-noise exitation ratio, articulation rate, and frequencies of formants), particularly when the voice is more noisy, aperiodic, irregular and chaotic which typically happens in more advanced stages of PD [24, 25, 26].
Acoustic features of the PD patients’ recordings and those of the healthy controls are modeled by GMMs with the likelihood function defined as:
[TABLE]
where is the feature vector at time frame , is the mixture weight of the mixture component, is the number of Gaussian mixtures, is a Gaussian probability density function where and are the mean and covariance of the mixture component, respectively. The parameters of the model, , are trained through the expectation-maximization algorithm [27].
Given , a sequence of feature vectors, the goal in PD detection is to find the model which maximizes , where . Using the Bayes’ rule, independence assumption between frames, and assuming equal priors for the classes, the PD detection system computes the log-likelihood ratio for an observation as:
[TABLE]
The final decision about the class assignment for an observation is made by setting a threshold over the obtained score.
II-1 Experimental Setup
In this study, we use the sustained vowel /a/ as the speech material for PD detection since they provide a simpler acoustic structure to characterize the glottal source and resonant structure of the vocal tract than running speech. Moreover, perceptual analysis of different vowels suggests that the best PD detection performance can be achieved when the sustained vowel phonation /a/ is parametrized by the PLP features [24]. We consider the mPower mobile Parkinson’s disease (MMPD) data set [28] which consists of more than 65,000 samples of 10 second sustained vowel /a/ phonations recorded via smartphones by PD patients and healthy speakers of both genders from the US. The designed voice test protocol for this data set required the participants to hold the phone in a similar position to making a phone call, take a deep breath and utter a sustained vowel /a/ at a comfortable pitch and intensity for 10 seconds. A subset of 800 good-quality voice samples (400 PD patients and 400 healthy controls equally from both genders) have been selected from this data set. It should be noted that the health status in this data set is self-reported. To have more reliable samples, among participants who self-reported to have PD, we selected those who claimed that they have been diagnosed by a medical professional with PD and recorded their voice right before taking PD medications. For the healthy controls, we selected participants who self-reported being healthy, do not take PD medications, and claimed that they have not been diagnosed by a medical professional with PD. All speakers of this subset had an age range of 58 to 72. The mean standard deviation (STD) of the age of PD patients and healthy controls are 644 and 664, respectively. For all experiments in this paper, we downsampled the recordings from 44.1 kHz to 8 kHz since the enhancement algorithms used in this work are operating at 8 kHz. To extract the PLP features, voice signals are first segmented into frames of 30 ms with 10 ms overlap using a Hamming window. Then, 13 PLP coefficients are computed for each frame of a signal. To consider the dynamic changes between frames due to the deviations in articulation, a first- and a second-order orthogonal polynomials are fitted to the two feature vectors to the left and right of the current frame. These features, which are referred to as delta and double-delta, were appended to the feature vector to form a 39-dimensional vector per each frame. The number of mixture components for the GMMs were set to 32.
II-2 Results
To evaluate the performance of the PD detection system in a matched acoustic condition, we used 5-fold cross validation (CV) in which the recordings were randomly divided into 5 non-overlapping and equal sized subsets. The entire CV procedure was repeated 10 times to obtain the distribution of detection performance. Fig. 1 shows the performance in terms of the receiver operating characteristic (ROC) curve, along with 95% confidence interval. In an ROC curve, the true positive rate is plotted against the false positive rate for different decision thresholds. The area under the curve (AUC) summarizes the ROC curve and represents the performance of a detection system by a single number between 0 and 1; the higher the performance, the closer the AUC value is to 1. Comparing with the commonly used classification accuracy, the AUC is a more preferred metric in this paper since it is a summary of the class overlap which sets a fundamental limit to the classification accuracy. The mean AUC for this PD detection system is 0.95.
III Impact of Signal Degradation on PD Detection
The PD detection system explained in the previous section gave a mean AUC of 0.95 in a matched acoustic condition. That is, when it was trained and tested using the clean recordings. However, as alluded to in the introduction, recordings collected remotely in an unsupervised manner are seldom clean as they are often degraded by different types of degradation. In this section we investigate the effect of 3 different commonly encountered degradations, namely reverberation, background noise and nonlinear distortion on the performance of the PD detection system. It should be noted that even though we tried to choose the most reliable samples from the MMPD data set, the labels might still not be 100% reliable as the diagnosis is self-reported. For this reason, we are more interested in how the relative PD detection performance is influenced systematically under application of different experimental conditions.
III-A Reverberation
Reverberation is a phenomenon that occurs when the signal of interest is captured in an acoustically enclosed space. Apart from the direct component, the microphone receives multiple delayed and attenuated versions of the signal, which is characterized by the room impulse response (RIR). A metric commonly used to measure the reverberation is the reverberation time (RT60) [29]. The presence of reverberation has shown to degrade the performance of speech-based applications such as speech and speaker recognition [30, 31]. In this section, we investigate the effect of reverberation on the PD detection performance. To this aim, we used 5-fold CV repeated 10 times to evaluate the performance. In each iteration, the model was trained using the clean recordings of the training subset, and evaluated on the recordings of the disjoint test subset which were filtered with synthetic room impulse responses of RT60 varying from 300 ms to 1.8 s in 300 ms steps measured at a fixed position in a room of dimension 10 m 6 m 4 m. The distance between source and microphone is set to 2m. The room impulse responses were generated using the image method [32] and implemented using the RIR Generator toolbox [33]. Fig. 2(a) shows the impact of reverberation on the PD detection performance in terms of the mean AUC along with 95% confidence intervals. We can observe from the plot that the PD detection system exhibits lower performance in reverberant environments, as expected, and the amount of degradation is related to the RT60.
III-B Background Noise
Background noise is one of the most common types of degradation occurring during remote voice analysis. In this section we restrict ourselves to additive background noise and investigate how this can influence the PD detection performance. To this aim, we performed the same CV procedure used for evaluating the impact of reverberation (explained in the previous section). In each iteration, the model was trained using the clean recordings of the training subset, and evaluated on the recordings of the test subset contaminated by an additive noise. The entire procedure was repeated for four different noise types, namely babble, restaurant, office and street noise222The babble, restaurant and street noise files have been taken from https://www.soundjay.com/index.html and the office noise has been taken from https://freesound.org/people/DavidFrbr/sounds/327497 and different signal-to-noise ratios (SNRs) ranging from -5 dB to 10 dB in 5 dB steps. Fig. 2(b) illustrates the impact of different noise types and different SNR conditions on the performance of the PD detection system in terms of the mean of AUC along with the 95% confidence intervals. We can observe a similar trends for all noise types that that the PD detection performance decreases as the noise level increases.
III-C Clipping
In remote voice analysis, nonlinear distortion can manifest itself in speech signals in many different ways such as clipping, compression, packet loss and combinations thereof. Here, we consider clipping as an example of nonlinear distortion in signals which is caused when a signal fed as an input to a recording device exceeds the dynamic range of the device [34]. By defining the clipping level as a proportion of the unclipped peak absolute signal amplitude to which samples greater than this threshold are limited, we can investigate the impact of clipping on the PD detection performance. To this aim, the clean recordings of the test subset in each iteration of the CV were clipped with different clipping levels ranging from 0.1 to 0.8 in 0.1 steps. Fig. 2(c) shows the performance as a function of clipping level. Similar to the other types of degradation, it can be observed that increasing the distortion level in voice signals decreases the PD detection performance.
IV Impact of Noise Reduction and Dereverberation on PD Detection
As seen in Section III, the degradation introduced to the signals can lead to reduction in the performance of the PD detection system. Since there are practically an infinite number of possible types and combinations of nonlinear distortion that can be present in a signal, and since there is a lack of well-documented algorithms for dealing with most of the distortions (even in isolation), in this section, we only consider the degradations for which there are well-documented and verified enhancement algorithms such as noise reduction and dereverberation and investigate the effects of these algorithms on the PD detection performance. To this end, from the 50 PD detection models developed and evaluated through 10 iterations of the 5-fold cross-validation procedure, as explained in Section (II-2), we selected one of the two models which showed the median performance and used it for further enhancement experiments in this section. We have used a total of 160 recordings for testing the algorithms used in this section. We will restrict ourselves to single channel enhancement algorithms. It should be noted that there exist a variety of objective and subjective metrics to measure the quality of the enhanced speech signal such as SNR, signal-to-distortion ratio [35], perceptual evaluation of speech quality [36] and short-time objective intelligibility [37]. However, since our main goal in this work is to study the influence of speech enhancement on the PD detection performance, we evaluate the effectiveness of the algorithms in terms of the AUC.
IV-A Dereverberation
Some of the popular classes of dereverberation techniques are the spectral enhancement methods [38], probabilistic model based methods [39, 40] and inverse filtering based methods [41, 42]. Spectral enhancement methods estimate the clean speech spectrogram by frequency domain filtering using the estimated late reverberation statistics. The probabilistic model based methods model the reverberation using an autoregressive (AR) process, and the clean speech spectral coefficients using a certain probability distribution function. The estimated parameters of the model are then used to perform dereverberation. Lastly, the inverse filtering methods use a blindly estimated room impulse response to design an equalization system. These methods, which are mainly developed for the running speech, assume that the signal at a particular time-frequency bin is uncorrelated with the signals at that same frequency bin for frames beyond a certain number [40]. However, this assumption is not valid for the sustained vowels which makes the dereverberation of the sustained vowels more challenging. Recently, deep neural network (DNN) based dereverberation algorithms have gained attention [43, 44] since they relax the assumption of uncorrelated neighboring time-frequency bins. The underlying principle of the DNN-based methods is to train a DNN to map the log-magnitude spectrum of the degraded speech to that of the desired speech.
In this section, we investigate the effectiveness of different dereverberation algorithms in improving the PD detection performance. For dereverberation experiments, we used three different algorithms: a probabilistic model based algorithm proposed in [40] (denoted as WPE-CGG, weighted prediction error with complex generalized Gaussian prior), an algorithm based on the inverse filtering of the modulation transfer function [41] (denoted as IF-MU, inverse filtering with multiplicative update), and a DNN-based speech enhancement algorithm proposed in [43] (denoted as DNN-SE). It should be noted that the WPE-CGG and the IF-MU are unsupervised methods whereas DNN-SE is a supervised method. For the DNN-based algorithm, a feedforward neural network with 3 hidden layers of 1,600 neurons was used. To take into account the temporal dynamics, features of 11 consecutive frames (including the current frame, 5 frames to the left and 5 frames to the right over time) were provided to represent the input features of the current frames. To train the DNN model, we selected 640 clean recordings from the MMPD data set and filtered them with the synthetic room impulse responses of RT60 ranging from 200 ms to 1 s in steps of 100 ms using the implementation in [33] for a particular source and receiver position in a room of dimensions 10 m 6 m 4 m. For testing, the position of the receiver was fixed while the position of the source was varied randomly from 60 degrees left of the receiver to 60 degrees right of the receiver. Fig. 3 shows the performance of the PD detection in terms of AUC for the different dereverberation algorithms. It can be observed from the figure that only DNN-SE is able to improve the PD detection performance while the other two methods degrade the performance. This is mainly due to two reasons: first, the DNN-SE is a supervised algorithm while the WPE-CGG and IF-MU are unsupervised; and second, the underlying assumption of the two unsupervised algorithms does not hold for the sustained vowels. We have also included the case of zero RT60 to investigate the impact of processing of the clean recordings by these dereverberation algorithms.
IV-B Noise reduction
Methods for performing noise reduction can be broadly categorized into supervised and unsupervised methods. Unsupervised methods do not assume any prior knowledge about identity of the speaker or noise environment. The supervised methods, on the other hand, make use of training data to train the models representing the signals of interest or the noise environment. Some of the popular classes of supervised speech enhancement methods include the codebook-based methods [45, 46], non-negative matrix factorization based methods [47, 10] and the DNN-based methods [48]. In the supervised method, the speech and noise statistics/parameters estimated using the training data are exploited within a filter to remove the noise components from the noisy observation. In this section, we used two supervised methods and one unsupervised method to investigate the effect of different noise reduction algorithms in reducing the acoustic mismatch between training and operating conditions.
The first supervised enhancement algorithm is based on the framework proposed in [49]. In this approach, a Kalman filter, which takes into account the voiced and unvoiced parts of speech [50], is used for enhancement. The filter parameters consist of the AR coefficients and excitation variance corresponding to speech and noise along with the pitch parameters (i.e. the fundamental frequency and the degree of voicing). Based on [49], the AR coefficients and excitation variance of the speech and noise are estimated using a codebook-based approach, and the pitch parameters are estimated from the noisy signal using a harmonic model based approach [51]. We refer to this method in the rest of this paper as the Kalman-CB. This algorithm has been selected because of its good performance in noise reduction in terms of quality and intelligibility based on both objective and subjective measures. The speech codebook was trained using 640 clean recordings selected from the MMPD data set (equally from both genders). To train the noise codebook, we used babble, restaurant, office and street noises to create four sub-codebooks. During the testing phase, all sub-codebooks, except the one corresponding to the target noise, were concatenated to form the final noise codebook. The size of the speech and noise codebooks were set to 8 and 12, respectively.
The second supervised enhancement method is the DNN-based algorithm proposed in [43]. This algorithm is the same as the one we used for dereverberation experiments, except it is trained using the noisy signals. This algorithm has been selected because, besides improvements in objective measures, it showed improvement in performance of automatic speech recognition in noisy environments. To train the DNN, we used the same 640 clean recording that we used for training the speech codebook in the Kalman-CB algorithm. The recordings were contaminated by three types of noise, namely babble, factory and F16 noises taken from NOISEX-92 database [52] under different SNR conditions selected randomly from the continuous interval [0,10] dB.
We used, as an unsupervised speech enhancement method, the algorithm proposed in [53] which is based on the minimum mean-square error (MMSE) estimation of discrete Fourier transform (DFT) coefficients of speech while assuming a generalized gamma prior for the speech DFT coefficients. This method, denoted as MMSE-GGP, is a popular unsupervised algorithm which uses the MMSE-based tracker for noise power spectral density estimation.
Fig. 4 shows the PD detection performance in terms of AUC for different noise types and SNR conditions. It can be observed from the figures that enhancing the degraded voice signals with the supervised methods in general improves the performance whereas the unsupervised method shows improvement only in the low SNR range and degrades the PD detection performance in higher SNR scenarios. The low performance of the unsupervised algorithm can be due to the fact that noise statistics in this case is estimated using a method proposed in [54] which has been designed for running speech rather than the sustained vowels. This observation is somewhat consistent with the statement in [16], which suggested that applying an unsupervised enhancement algorithm to the voiced segments results in a degradation in PD detection performance.
IV-C Joint Noise Reduction and Dereverberation
In Sections IV-A and IV-B, we showed the impact of noise reduction and dereverberation when one of these degradations was present in the signal. However, in some cases, the recordings may be degraded simultaneously by reverberation and background noise. There have been methods proposed for joint noise reduction and dereverberation with access to multiple channels [55, 56]. Since we have restricted ourselves to single channel enhancement methods, and motivated by the improvement in the PD detection performance as a result of using the DNN-SE algorithm for noise reduction and dereverberation, in this section, we investigate the effectiveness of this algorithm in performing joint noise reduction and dereverberation. In this case, the input to the DNN is the log-magnitude spectrum of the signal which is degraded by reverberation and background noise. For training the DNN model, the same 640 clean recordings that we used in the previous enhancement experiments were filtered with RIRs of different RT60s ranging from 400 ms to 1 s with 200 ms steps. Then, three types of noise, namely babble, factory and F16 noises (taken from NOISEX-92 database) were randomly added to the reverberant signals at different SNRs selected uniformly at random from the continuous interval [0,10] dB. Table I summarizes the impact of joint noise reduction and dereverberation using the DNN-SE algorithm on the PD detection performance. In this table, we have also included the cases of infinite SNR and zero RT60 to investigate the effect of the enhancement system when the clean recordings or the ones degraded by only noise or reverberation were processed by this algorithm. It can be observed for the case of babble noise that the DNN-SE improves the PD detection performance in most of the cases when reverberation and background noise coexist and in the cases where only noise is present. However, in the case of only reverberation, the DNN-SE shows improvement only in the cases where RT60 is 400 ms and above. It should be noted that the babble noise used for training and testing were taken from two different noise databases. In the case of restaurant noise, improvement in PD detection performance is observed only in the low SNRs, namely -2 dB and -6 dB. The results of the restaurant noise is interesting in a sense that it shows how the DNN-SE algorithm can generalize for a noise type not seen during the training phase.
V Automatic Quality Control in Pathological Voice Recordings
We have shown in the previous section that, assuming the specific degradation is known, there exist algorithms to effectively transform a voice signal from a degraded condition into the acoustic condition in which models are trained. Choosing the appropriate enhancement algorithm, however, requires prior knowledge about the presence and type of degradation in a voice signal. In this section, we introduce two approaches to automatically control the quality of recordings. The first approach detects, at recording level, the presence and type of degradation which has influenced the majority of frames of the signal. The second approach, on the other hand, detects short-term degradations and protocol violations in a signal.
V-A Recording-Level Quality Control
The major limitation of the classification-based approaches for identifying the type of degradation in a voice signal [18, 19] is that they do not consider the fact that a recording can be subject to an infinite number of possible combinations of degradations in real scenarios. This causes some problems when a signal is contaminated by a new type of degradation for which the classifier has not been trained. Moreover, there is no control in class assignment for a high-quality outlier which do not comply with the context of the data set.
To overcome these limitations, instead of using a multiclass classifier, we propose to use a set of parallel likelihood ratio detectors for the major types of degradations commonly encountered in remote voice analysis, each detecting a certain degradation type. This way, the likelihood ratio statistics of an observation given each of the models can be translated to the degree of contribution of each degradation to the degraded observation. Moreover, completely new degradation types and high-quality outliers can be detected if all models reject those observations according to a pre-defined threshold.
In this approach, the task of each detector is to determine whether a feature vector of the time frame of a voice signal, , was contaminated by the corresponding degradation, , or not, . The decision about the adherence of each frame of a given speech signal to the hypothesized degradation is then computed as:
[TABLE]
where is a pre-defined threshold for detection, and and are respectively the likelihood of the hypotheses and given .
To model the characteristics of each hypothesized degradation, we propose to fit a GMM of the likelihood function defined in (1) to the frames of the recordings in the feature space. The motivation for using GMMs is that they are computationally efficient models that are capable of modeling sufficiently complex densities as a linear combination of simple Gaussians. Thus, the underlying acoustic classes of the signals might be modeled by individual Gaussian components. While the hypothesized degradation models can be well characterized by using training voice signals contaminated by the corresponding degradation, it is very challenging to model the alternative hypothesis as it should represent the entire space of all possible negative examples expected during recognition. To model the alternative hypothesis, instead of using individual degradation-specific alternative models, we train a single degradation-independent GMM using a large number of clean, degraded and outlier voice signals. Since this background model is used as an alternative hypothesis model for all hypothesized degradations, it is referred to as a universal background model (UBM).
When the UBM is trained, a set of degradation-dependent GMMs for modeling clean, noisy, reverberant and distorted recordings, , are derived by adapting the parameters of the UBM through a maximum a posteriori estimation and using the corresponding training data. Given the UBM, , and the trained degradation model, , and assuming that the feature vectors are independent, the log-likelihood ratio for a test observation, , is calculated as:
[TABLE]
The scaling factor in (4) is used to make the log-likelihood ratio independent of the signal duration and to compensate for the strong independence assumption for the feature vectors [57]. The decision for the test observation can be made by setting a threshold over the scores.
To parametrize the recordings, we propose to use mel-frequency cepstral coefficients (MFCCs) [58]. Because it has been demonstrated in [18, 59] that degradation in speech signals predictably modifies the distribution of the MFCCs by changing the covariance of the features and shifting the mean to different regions in feature space, and the amount of change is related to the degradation level.
V-A1 Experimental Setup
For training the UBM, we randomly selected 8,000 recordings from the MMPD data set. To make the training data balanced over the subpopulations to avoid the model to be biased towards the dominant one, we randomly divided this subset into 5 equal partitions of 1,600 samples. The recordings of the first partition were randomly contaminated by six different types of noise namely babble, street, restaurant, office, white Gaussian and wind noises under different SNR conditions ranging from -10 dB to 20 dB in 2 dB steps. The recordings of the second partition were filtered by 46 real room impulse responses (RIRs) of the AIR database [60], measured with mock-up phone in different realistic indoor environments, to produce reverberant data. As an example of non-linearities in signals, the recordings of the third partition were processed randomly by either clipping, coding or clipping followed by coding. The clipping level was set to 0.3, 0.5 and 0.7. We used 9.6 kbps and 16 kbps code-excited linear prediction (CELP) codecs [61]. To consider the combination of degradations in signals, the recordings of the forth partition were randomly filtered by 46 different real RIRs and added to the noises typically present in indoor environments, namely babble, restaurant and office noise at 0 dB, 5 dB and 10 dB. The recordings of the last partition were used without any processing. The last subset also contains some outliers which do not contain relevant information for PD detection.
For adaptation of the degradation-dependent models, a subset of 800 good-quality recordings of PD patients and healthy speakers of both genders were equally selected from the MMPD data set. From this subset, 200 recordings were corrupted by babble, restaurant, street and office noises under different SNR conditions ranging from -5 dB to 10 dB in 5 dB steps. Another subset of 200 recordings were selected to be filtered by 16 real RIRs from AIR database. A subset of 200 recordings were also chosen to represent nonlinear distortions in signals by processing them in a same way the UBM data were distorted. The remaining 200 recordings were kept unchanged to represent the clean samples.
Using a Hamming window, recordings were segmented into frames of 30 ms with 10 ms overlap. For each frame of a signal, 12 MFCCs together with the log energy are calculated along with delta and double-delta coefficients. They are concatenated to form a 39-dimensional feature vector.
V-A2 Results
To evaluate the proposed approach in identifying degradations in data not observed during the training phase, we used 10-fold cross validation with 10 iterations. For each experiment, we extended the test subset by adding 20 good-quality outlier recordings, including irrelevant sounds for PD detection randomly selected from the MMPD data set, to show whether the detectors could reject such outliers. Moreover, as an example of combination of degradations in speech signals, 20 good-quality recordings were selected from the MMPD data set, contaminated by noise and reveberation in a similar way we did for the UBM data, and appended them to the test subset to investigate whether both the noise and reverberation detectors could identify these recordings.
Fig. 5 shows the performance of the detectors in terms of AUC, along with 95% confidence intervals, as a function of the number of mixture components in GMMs. We can observe from the results that the degradations in voice signals are effectively identified when GMMs with 1024 mixtures are used. The lower performance for reverberation detection model is mainly due to misdetection of some of the recordings in which noise and reverberation coexist but the noise is more dominant than the reverberation. This can also be explained by considering the analysis of vowels in the presence of different degradations [18] which shows that MFCCs of the reverberant signals are, on average, positioned closer to the MFCCs of the clean signals, while noise and distortion (clipping) shift the MFCCs farther away from the position of clean MFCCs.
V-B Frame-Level Quality Control
While many types of degradation, such as reverberation and nonlinear distortions, typically influence the entire recording, additive noise can have a short-term impact on a signal. Moreover, the test protocol can be violated for a short period of time in a remotely collected voice signal. In recording-level degradation detection, we assumed that the majority segments of a voice signal are influenced by some types of degradation. Likewise, if a voice sample is an outlier, the majority segments of the signal are assumed to contain irrelevant information for PD detection. Even though beneficial in providing a global information about the quality of a signal, it does not say whether a degraded or an outlier signal still contains useful segments to be considered for PD detection. Identifying these segments facilitates making the most use of the available data.
In this paper, we consider additive noise as an example of a short-term degradation in a signal, and develop a framework which splits a voice signal into variable duration segments in an unsupervised manner by fitting an infinite hidden Markov model (iHMM) to the frames of the recordings in the MFCC domain. Then, the degraded segments and those that are associated with the protocol adherence or violation are identified by applying a multinomial naive Bayes classifier.
A HMM represents a probability distribution over sequences of observations by invoking a Markov chain of hidden state variables where each is in one of the possible states [62]. The likelihood of the observation is modeled with a distribution of mixture components as:
[TABLE]
where are the time-independent emission parameters, , , is the transition matrix of . We consider a HMM for clustering the frames of the signals in terms of different acoustic events. The prediction of the number of states required to cover all events such that we do not encounter unobserved events in the future is challenging. Moreover, it is reasonable to assume that as we observe more data, different types of protocol violations and acoustic events will appear and thus the inherent number of states will have to adapt accordingly. Here, we propose to use an infinite HMM to relax the assumption of a fixed in (5), which is defined as:
[TABLE]
where are drawn from a Dirichlet process (DP) with a local concentration parameter , is the stick-breaking representation for DPs which is drawn from Griffiths-Engen-McCloskey (GEM) distribution with a global concentration parameter [63], each is a sample drawn independently from the global base distribution over the component parameters of the HMM , and is the observation model for each state. The iHMM can possibly have countably infinite number of hidden states. Using the direct assignment Gibbs sampler, which marginalizes out the infinitely many transition parameters, we infer the posterior over the sequence of hidden states and emission parameters . In each iteration of the Gibbs sampling, we first re-sample the hidden states and then the base distribution parameters. For more details about the inference, we refer to [21].
Considering an iHMM as a clustering algorithm, segments of the voice recordings with similar characteristics are clustered together under the same state indicator values. To identify the segments of the signal that are sufficiently reliable for detecting PD voice symptoms, those that need enhancement before being used for PD detection, and those which do not contain relevant information for PD detection, we propose to use the multinomial naive Bayes classifier to map the state indicators to the labels , where if adheres to the protocol, if it complies with the protocol but is degraded by additive noise, or if it violates the protocol. In the multinomial naive Bayes, we assume that the samples in different classes have different multinomial distributions, and a feature vector for the observation is a histogram, with being the number of times state is observed. The likelihood of the histogram of a new observation is defined as:
[TABLE]
where is the probability of the attribute being in class , which is trained using the training data. Using the Bayes rule and the prior class probability , the class label for a new test observation is predicted as:
[TABLE]
V-B1 Experimental Setup
To evaluate the performance of the proposed method, a subset of 100 good-quality recordings (50 PD patients and 50 healthy controls equally from both genders) has been selected from the MMPD data set. From this subset, 50 recordings were selected and 60% of each signal were degraded by adding noise. We used babble, office, restaurant, street and wind noises, under different SNR conditions ranging from -5 dB to 10 dB in steps of 2.5 dB. In addition, 20 recordings from the MMPD data set containing several short- and long-term protocol violations were selected and added to the subset.
Using a Hamming window, recordings are segmented into frames of 30 ms with 10 ms overlap. For each frame of a signal, 12 MFCCs along with the log energy are calculated. The features of every five consecutive frames are averaged to smooth out the impact of articulation [59], and to prevent capturing very small changes in signal characteristics, which results in producing many uninterpretable states. Thus, each observation represents an averaged MFCCs of 100 ms of a signal. For the iHMM, we use the conjugate normal-gamma prior over the Gaussian state parameters, set the hyper-parameters \alpha$$=$$\gamma=10, and run the inference for 150 iterations.
V-B2 Results
The top plot in Fig. 6 shows a segment of 10 seconds duration selected from the data set. The segments of the signal which adhere to the test protocol and those that need enhancement are hand-labeled and shaded in green and pink, respectively. Fitting the iHMM to the data, 49 different states were discovered in this particular subset. The middle plot in Fig. 6 illustrates the generated states in different colors. To evaluate the performance of the proposed approach for data not observed during the training phase (i.e. out of sample), we used 10-fold CV and repeated the procedure 10 times. The results, presented in Table II, indicate that the proposed method can automatically identify short-term degradation and protocol violations in pathological voices with a 0.1 second resolution and high accuracy.
V-C Integrating Quality Control and Enhancement Algorithms
The proposed quality control approaches can be integrated with the enhancement algorithms for cleaning-up the remotely collected signals before they are being processed by a PD detection system. In this section, we evaluate how this integration can lead to improvement in PD detection accuracy.
The recording-level algorithm can be used in many different ways to provide information about the presence and type of degradation in a signal for an automatic clean-up process. For example, one possible scenario could be to convert the parallel detectors to a multi-class classifier by calculating the maximum a posteriori probability for a new observation. Then, the enhancement algorithm for which the observation has the highest degradation class probability will be applied. Nevertheless, the advantage of the proposed method over the classification-based techniques is its capability to detect outlier recordings and those degraded by a new type of degradation. Thus, alternative approach could be to exploit the detectors to activate or bypass a set of enhancement blocks connected in series (e.g. noise reduction followed by dereverberation). This scenario not only allows enhancement of a signal degraded by more than one degradation, but also prevents outliers to be processed by the PD detection system. However, since there is no ground truth health status label for the outlier recordings, it is not possible to evaluate the performance of the PD detection system in the presence of outliers. For this reason, we considered a simple scenario in which the test subset only contains clean, noisy and reverberant recordings. Since there was no outlier in the test samples, the problem is simplified to a multi-class classification task. For the experiment, we used the same 160 test recordings we used for the enhancement experiments. From this subset, 60 recordings were randomly selected and corrupted by restaurant, office and street noises under different SNR conditions ranging from -5 dB to 7 dB in 4 dB steps. Another 60 randomly chosen recordings were filtered by 16 real RIRs from AIR database. The enhancement algorithm used in this experiment is the DNN-SE. The model for noise reduction was trained using the noisy recordings and the model used for dereverberation was trained using reverberant recordings. Table III shows the PD detection performance in terms of AUC for four different scenarios: (1) when no enhancement is applied to the recordings, (2) when the recordings, regardless of the presence and type of degradation, were processed randomly by either of the enhancement algorithms, (3) when recordings were enhanced by the enhancement model selected based on the estimated degradation labels, and (4) when the degraded recordings were enhanced based on the ground truth degradation labels. Comparing the results of the first and the second rows with those of the third and the forth rows suggests that applying appropriate enhancement algorithms to the degraded signals leads to an improvement in PD detection performance, and the level of improvement is related to the accuracy of the degradation detection system.
In the next experiment, we investigate how the proposed frame-level quality control method can improve the performance of PD detection. To this aim, we randomly added babble, restaurant, office and street noises to all 160 test recordings at different SNRs ranging from -5 dB to 10 dB in 5 dB steps. However, for making a signal noisy, instead of adding a noise to the entire signal, we randomly corrupted 60% frames of the signal. The enhancement algorithm used in this experiment is the Kalman-CB. In Table IV, we compare the PD detection performance for four different scenarios: (1) when no enhancement is applied to the recordings, (2) when the entire signals are enhanced, (3) when the signals are enhanced based on the predicted labels, and (4) when the signals are enhanced based on the ground truth labels.
For the last two scenarios, only the segments of the signals identified/labeled as degraded were enhanced. Moreover, we dropped the features of the frames identified as protocol violation. Comparing the result of second scenario with the last two scenarios, we can observe the superiority of integrating the proposed frame-level quality control and the enhancement algorithm in dealing with short-term degradation and protocol violations in recordings.
VI Conclusion
Additive noise, reverberation and nonlinear distortion are three types of degradation typically encountered during remote voice analysis which cause an acoustic mismatch between training and operation conditions. In this paper, we investigated the impact of these degradations on the performance of a PD detection system. Then, given that the specific degradation is known, we explored the effectiveness of a variety of the state-of-the-art enhancement algorithms in reducing this mismatch and, consequently, in improving the PD detection performance. We showed how applying appropriate enhancement algorithms can effectively improve the PD detection accuracy. To inform the choice of enhancement method, we proposed two quality control techniques operating at recording- and frame-level. The recording-level approach provides information about the presence and type of degradation in voice signals. The frame-level algorithm, on the other hand, identifies the short-term degradations and protocol violations in voice recordings. Experimental results showed the effectiveness of the quality control approaches in choosing appropriate signal enhancement algorithms which resulted in improvement in the PD detection accuracy.
This study has important implications that extend well beyond the PD detection system. It can be considered as a step towards the design of robust speech-based applications capable of operating in a variety of acoustic environments. For example, since the proposed quality control approaches are not limited to specific speech types, they can be used as a pre-processing step for many end-to-end speech-based systems, such as automatic speech recognition, to make them more robust against different acoustic conditions. They might also be utilized to automatically control the quality of recordings in large-scale speech data sets. Moreover, these approaches have the potential to be used for other sensor modalities to identify short- and long-term degradations and abnormalities which can help to choose an adequate action.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] L. S. Ishihara, A. Cheesbrough, C. Brayne, and A. Schrag, “Estimated life expectancy of Parkinson’s patients compared with the UK population,” Journal of Neurol Neurosurg Psychiatry , vol. 78, pp. 1304–1309, 2007.
- 2[2] A. K. Ho, R. Iansek, C. Marigliani, J. L. Bradshaw, and S. Gates, “Speech impairment in a large sample of patients with Parkinson’s disease.” Behavioural Neurology , vol. 11, no. 3, pp. 131–137, 1998.
- 3[3] I. Eliasova, J. Mekyska, M. Kostalova, R. Marecek, Z. Smekal, and I. Rektorova, “Acoustic evaluation of short-term effects of repetitive transcranial magnetic stimulation on motor aspects of speech in Parkinson’s disease,” Journal of Neural Transmission , vol. 120, no. 4, pp. 597–605, 2013.
- 4[4] A. Tsanas, M. A. Little, P. E. Mc Sharry, J. Spielman, and L. O. Ramig, “Novel speech signal processing algorithms for high-accuracy classification of Parkinson’s disease,” IEEE Transactions on Biomedical Engineering , vol. 59, pp. 1264–1271, 2012.
- 5[5] A. Zhan, M. A. Little, D. A. Harris, S. O. Abiola, E. R. Dorsey, S. Saria, and A. Terzis, “High frequency remote monitoring of Parkinson’s disease via smartphone: platform overview and medication response detection,” ar Xiv preprint ar Xiv:1601.00960 , pp. 1–12, 2016.
- 6[6] D. Gil and M. Johnson, “Diagnosing Parkinson by using artificial neural networks and support vector machines,” Global Journal of Computer Science and Technology , pp. 63–71, 2009.
- 7[7] J. Rusz, J. Hlavnička, T. Tykalová, M. Novotný, P. Dušek, K. Šonka, and E. Ružička, “Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease,” IEEE Transactions on Neural Systems and Rehabilitation Engineering , vol. 26, no. 8, pp. 1495–1507, 2018.
- 8[8] J. Fan, F. Han, and H. Liu, “Challenges of big data analysis,” National Science Review , vol. 1, no. 2, pp. 293–314, 2014.
