A modulation property of time-frequency derivatives of filtered phase   and its application to aperiodicity and fo estimation

Hideki Kawahara; Ken-Ichi Sakakibara; Masanori Morise; Hideki Banno,; Tomoki Toda

arXiv:1706.02964·eess.AS·July 6, 2018

A modulation property of time-frequency derivatives of filtered phase and its application to aperiodicity and fo estimation

Hideki Kawahara, Ken-Ichi Sakakibara, Masanori Morise, Hideki Banno,, Tomoki Toda

PDF

TL;DR

This paper presents a linear SNR estimator for speech aperiodicity that estimates background noise levels indirectly, applicable across various window functions, and can replace existing periodicity detection modules in vocoders.

Contribution

It introduces a novel, simple, and linear SNR estimator based on time-frequency derivatives that reliably estimates aperiodicity without direct noise extraction.

Findings

01

Effective SNR estimation from 0dB to 80dB without calibration

02

Applicable to various window functions with low sidelobe levels

03

Can replace existing aperiodicity estimation modules in vocoders

Abstract

We introduce a simple and linear SNR (strictly speaking, periodic to random power ratio) estimator (0dB to 80dB without additional calibration/linearization) for providing reliable descriptions of aperiodicity in speech corpus. The main idea of this method is to estimate the background random noise level without directly extracting the background noise. The proposed method is applicable to a wide variety of time windowing functions with very low sidelobe levels. The estimate combines the frequency derivative and the time-frequency derivative of the mapping from filter center frequency to the output instantaneous frequency. This procedure can replace the periodicity detection and aperiodicity estimation subsystems of recently introduced open source vocoder, YANG vocoder. Source code of MATLAB implementation of this method will also be open sourced.

Tables2

Table 1. Table 1: Tested windowing functions. Windows 1 to 5 are cosine serie [ 33 , 34 , 14 ] . Windows 7 and 8 are the best energy bounded functions [ 30 , 31 ] . The window 6 is their approximation [ 32 ] .

	maximum	asymptotic	nominal
	sidelobe level	decay rate	length
Type	(dB)	(dB/oct)	(re. $1 / f_{o}$ )
1 Hanning	-31.47	18	2
2 Blackman	-58.11	18	3
3 Nuttall-11	-82.60	30	4
4 Nuttall-15	-98.17	6	4
5 six-term	-114.24	54	6
6 Kaiser	-98.32	6	4.273
7 DPSS-4	-95.47	6	4.078
8 DPSS-4.5	-108.57	6	4.567

Table 2. Table 2: Duration and band-width of windowing functions

Type	$σ_{t}$	$σ_{t} / σ_{t R}$	$σ_{f}$	$σ_{f} / σ_{f R}$
1 Hanning	0.2829	0.9800	0.2861	0.4959
2 Blackman	0.3564	1.2346	0.2216	0.3841
3 Nuttall-11	0.3828	1.3261	0.2057	0.3565
4 Nuttall-15	0.4106	1.4223	0.1915	0.3320
5 six-term	0.4444	1.5394	0.1766	0.3061
6 Kaiser	0.4142	1.4349	0.1898	0.3289
7 DPSS-4	0.4072	1.4105	0.1931	0.3348
8 DPSS-4.5	0.4299	1.4892	0.1827	0.3167

Equations21

ω_{i} (t)

ω_{i} (t)

ω_{i} (t, ω)

ω_{i} (t, ω)

x_{d} (t, ω)

x_{d} (t, ω)

ω_{i s} (t, ω)

ω_{i s} (t, ω)

w (t)

w (t)

{h_{k}}_{k = 0}^{5}

0.0726831633, 0.0125124215, 0.0007833203},

C_{Δ ω}

C_{Δ ω}

C_{Δ ω t}

C_{Δ ω t}

η (t, ω)

η (t, ω)

σ_{t}^{2}

σ_{t}^{2}

σ_{f}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A modulation property of time-frequency derivatives of filtered phase and

its application to aperiodicity and $f_{\rm o}$ estimation††thanks: The main body of this article is accepted for publication in Interspeech2017. This article has supplemental materials for details which were dropped due to limited space.

Abstract

We introduce a simple and linear SNR (strictly speaking, periodic to random power ratio) estimator (0 dB to 80 dB without additional calibration/linearization) for providing reliable descriptions of aperiodicity in speech corpus. The main idea of this method is to estimate the background random noise level without directly extracting the background noise. The proposed method is applicable to a wide variety of time windowing functions with very low sidelobe levels. The estimate combines the frequency derivative and the time-frequency derivative of the mapping from filter center frequency to the output instantaneous frequency. This procedure can replace the periodicity detection and aperiodicity estimation subsystems of recently introduced open source vocoder, YANG vocoder. Source code of MATLAB implementation of this method will also be open sourced.

Index Terms: speech analysis, fundamental frequency, aperiodicity, instantaneous frequency, group delay

1 Introduction

Aperiodic components in speech sound play important roles in speech communication (normal, expressive, extreme and so on) and singing expression[1, 2, 3, 4]. They contribute to the synthesized speech quality and intelligibility[5, 6, 7, 8]. In spite of their importance, reliable estimation of aperiodic components has been a challenging topic[9, 10, 11, 12, 13]. The major difficulty is the fact that such aperiodic components in speech are generally much weaker than the periodic components. We propose to use deviations of the prominent periodic components to estimate the level of the aperiodic components[10], instead of directly extracting and measuring them.

This article is organized as follows. First, we briefly review aperiodicity estimation methods and their underlying principles. Second, we intuitively introduce the underlying idea of the proposed method. Third, we introduce and discuss on each component of the proposed method. They are, stabilization of instantaneous frequency estimation, modulation behavior of mapping from the filter center frequency to the output instantaneous frequency, and effects of the maximum sidelobe level and the decaying behavior of sidelobes. Then, we introduce and test several implementation models and their performance. Finally, we discuss the issues of applying the proposed method to speech analysis / synthesis systems.

The main contribution of this paper is a simple SNR estimator based on the above-mentioned idea. The estimator uses a six-term cosine series function, which was proposed to implement the antialiased Fujisaki-Ljungqvist model[14]. This function provides a linear and robust SNR estimation ranging from 0 dB to 80 dB SNR without additional linearization.

2 Background

Voiced sounds are almost periodic but are not perfectly periodic[5]. Amplitude and timing of each glottal cycle are not exactly the same each other. It also comprises random noise, owing to turbulence around constrictions of vocal tract and glottis. Among these factors, amplitude and timing deviations can be estimated using spectrum analysis and $f_{\rm o}$ ††We use “ $f_{\rm o}$ ” instead of “F0” based on the recommendation[15].

(fundamental frequency) analysis, respectively[13, 16, 17]. In this paper, we focus on the remaining difficult issue, the random noise. We also focus on its spectral shape, because our primary goal is to use them for designing excitation signals for speech synthesis and modification.

One typical approach for estimating random noise component is to extract random components using harmonic analysis or using residuals of inverse filtering[18, 19, 9, 20, 12, 13, 21]. Prediction and subtraction of waveforms one pitch-period apart, also try to measure the residual levels[22, 13, 23]. These approaches are generally sensitive to the $f_{\rm o}$ estimation error and the model parameter errors, because of the large level difference between prominent periodic component and the random noise component. In other words, removing the effects of the strong periodic component is the key issue to be solved.

We introduced an algorithm to eliminate the prominent periodic component, without estimating the frequency of the periodic component[17]. The algorithm alleviates the issue mentioned above based on a self-tuning process, which does not require prior frequency information. This merit of the algorithm is obtained at the expense of temporal resolution. In this paper, we propose an alternative approach, which does not estimate the random components directly and does not need to sacrifice the temporal resolution. It starts from revisiting our legacy-STRAIGHT[24] and the fixed-point-based $f_{\rm o}$ extractor[10].

3 Phase-deviation-based SNR estimation

The instantaneous frequency of a sinusoidal component, which is separated by using a band-pass filter, is modulated by other components contained in the same filter output. The relative background noise level to the prominent periodic component can be estimated by measuring the magnitude of this modulation. This idea was implied in our fixed-point-based $f_{\rm o}$ extractor[10] but was not investigated further. We revisit this idea in this paper based on detailed analyses of the filtered signal phase behavior and the windowing function.

We focus on phase derivatives, instead of using the phase directly. Derivatives of the phase of a filtered signal are easier to interpret than phase itself. The time derivative of the phase provides the instantaneous frequency and the frequency derivative (with negative sign) provides the group delay. The power-spectrum-weighted average of each representation provides centroid of the frequency and the temporal position, respectively[25, 26, 27]. These are better understood by deriving Flanagan’s instantaneous frequency equation[28] and similarly derived group delay equation[29].

When a band-pass filter comprises one prominent sinusoidal component and a low-level interfering sinusoid, the instantaneous frequency of the filter output varies at the rate of their frequency difference. The magnitude of the variation is proportional to the ratio of their amplitude. This ratio depends on the relative location of the filter center frequency and the components. In other words, the magnitude of the frequency derivative of the mapping from the filter center frequency to the output instantaneous frequency provides the estimate of the ratio of their amplitude. For a sinusoidally varying signal, the signal and its time derivative are orthogonal to each other and the magnitude is represented as the RMS (root mean square) value with proper scaling. A random noise is a collection of independent identically distributed frequency components, the RMS value of the frequency derivative and the time-frequency derivative of the above-mentioned mapping provides the ratio of the prominent sinusoidal component and the background random component. This is an intuitive introduction to the proposed method.

For the proposed method to work properly, the following issues to be solved.

Removal of singularities

Cancellation of neighboring harmonic components yields singularities of instantaneous frequency. Effects of background noise are significantly magnified in the vicinity of these singularities and are harmful for reliable estimation.

Rejection of leakage from outside

Voiced signal consists of harmonic components. The target band-pass filter has to isolate one harmonic component from the others. The other harmonic components have to be sufficiently attenuated.

Scaling of derivatives

For minimizing the effect of the leakage and statistical fluctuations, the signal and the time derivative of the signal have to be properly scaled.

Design and calibration

The estimator has performance governing system parameters. The estimator design needs selecting and tuning these parameters, followed by calibration using known test signals.

The following sections address and solve each problem.

3.1 Removal of singularities

To remove singularities, revisit the derivation of the instantaneous frequency. Assume that the impulse response of a band-pass filter is complex valued and the filtered output $x(t)=|x(t)|e^{j\theta(t)}$ is also complex valued. The imaginary part of the logarithmic conversion of $x(t)$ is the phase function $\theta(t)$ . The time derivative of it yields the Flanagan’s phase equation[28]. By definition, it is the instantaneous (angular) frequency $\omega_{i}(t)$ :

[TABLE]

where the denominator is zero yields singularity.

Let us define the filter impulse response $h(t,\omega)$ by using a symmetric non-negative windowing function $w(t)$ for its envelope and using a complex exponential $e^{j\omega t}$ for the carrier signal. Therefore, the filter output is a function of $t$ and the carrier frequency $\omega$ and is represented using $x(t,\omega)$ . The instantaneous frequency of this filter output $\omega_{i}(t,\omega)$ is represented as follows:

[TABLE]

where $x_{d}(t,\omega)$ represents the time derivative of $x(t,\omega)$ . Because the filter output $x(t,\omega)$ is the convolution of the input signal $s(t)$ and the impulse response $h(t,\omega)$ , the time derivative of $x(t,\omega)$ is represented as follows:

[TABLE]

where “ $\ast$ ” represents convolution and $h_{d}(t,\omega)$ represents the time derivative of $h(t,\omega)$ . Equations 2 and 3 provide a procedure for calculating instantaneous frequency without introducing phase unwrapping and numerical differentiation.

Equation 2 indicates that making denominator always positive removes singularities. Specifically, using $|x(t,\omega)|^{2}$ for the weight and using a non-negative smoothing kernel $w_{F}(\omega)$ to calculate the following weighted average $\omega_{is}(t,\omega)$ removes singularities[25, 26].

[TABLE]

where the kernel $w_{F}(\omega)$ is bounded inside $(-\omega_{w},\omega_{w})$ .

3.2 Rejection of leakage from outside

Because speech signals are time varying, the observation window has to be bounded in time. This finite bounding results in infinite spread in the frequency domain. The prolate spheroidal wave function provides the best time-frequency concentration among time bounded functions[30, 31]. The Kaiser window[32] is an engineering approximation of this best function.

The frequency domain representation of bounded functions has the main lobe and the sidelobes. The normalized frequency bandwidth characterizes the main lobe. The maximum level and the decay rate characterize the sidelobes. These indices can be designed and have to be tested in terms of their effects on the SNR estimation performance. Cosine series windowing functions[33, 34, 14] are useful for testing trade-offs between these design indices.

We proposed a six-term cosine window[14] for antialiased Fujisaki-Ljungqvist model and it turned out that the proposed six-term window also is the best function for SNR estimation. It has strong attenuation (-114.24 dB) and steep decay rate (54 dB/oct) of sidelobes. The form of the window $w(t)$ and the coefficients are as follows:††The coefficients are rounded to ten digits to the right of the decimal point. The best coefficients were selected from rounded numbers. Derivation details are given in our paper[14].

[TABLE]

where $t_{w}$ represents the half length of the windowing function and for $|t|\geq t_{w}$ , the value of $w(t)$ is zero.

3.3 Scaling of derivatives

The variance of the frequency derivative of the mapping from the filter center frequency to the output instantaneous frequency, $V_{\Delta\omega}=V\!\left[\frac{d\omega_{is}(t,\omega)}{d\omega}\right]$ , is approximately proportional to the following value:

[TABLE]

where $W(\omega)$ represents the frequency response of the response envelope $w(t)$ . The variance of the time-frequency derivative $V_{\Delta\omega t}=V\!\left[\frac{d^{2}\omega_{is}(t,\omega)}{d\omega dt}\right]$ is also approximately proportional to:

[TABLE]

Scaling each derivative to make $V_{\Delta\omega}=V_{\Delta\omega t}$ minimizes the variance of the mixed index $\eta(t,\omega)$ defined as follows:

[TABLE]

where $C_{eq}\propto C_{\Delta\omega}/C_{\Delta\omega t}$ represents the scaling coefficient.

3.4 Design and calibration

††MATLAB scripts and functions used here are online-available.

http://www.sys.wakayama-u.ac.jp/%7ekawahara/IS2017AP/

We conducted a set of simulations to set design parameters, the windowing function, the window length, the smoothing kernel and the size, and the scaling coefficients. The first test is for trade-offs between the maximum sidelobe level and the decay rate of sidelobes. Table 1 shows the tested windowing functions. The items named “DPSS-4” and “DPSS-4.5” represent the windows based on the prolate spheroidal wave function, which is implemented as a MATLAB function dpss. The table lists the characterizing indices of each windowing function. The length (nominal length) of each window was designed so as to locate the first zero point at $f_{\rm o}$ .

The test signal was a 100 Hz pulse train plus Gaussian white noise with given SNR. The sampling frequency was set 44,100 Hz. The length of the FFT buffer was 32,768 samples. We tested at all possible relative window center locations in each pitch cycle. We also tested at all harmonic frequencies up to the Nyquist frequency and yielded 96,579 estimates for each test condition. Each window was scaled 1.5 times of its nominal length and the width of the frequency smoother $w_{F}(\omega)$ ††We used a raised cosine for $w_{F}(\omega)$ in this implementation. Other smoothing functions are being tested. was set a value from 0.25 to 0.35, based on preliminary tests. Each estimator was calibrated using the median of the distribution of the mixed index $\eta(t,\omega)$ .

Figure 1 shows the relation between the given SNR and the median of estimated SNR using the windows in Table 1. The results suggest that the most contributing factor on the estimation accuracy is the maximum sidelobe level. The proposed six-term cosine series shows the best linearity from 0 dB to 80 dB SNR range. Other windows show saturation in the estimated SNR.

Figure 3 and 3 illustrate the behavior of outliers of the estimates. Figure 3 shows the distance between 0.5% and 99.5% points of each cumulative distribution. It illustrates that the windows with 6 dB/oct sidelobe decay rate have extreme outliers. The smaller width of distribution for Hanning and Nuttallwin-11 is caused by the saturation of the estimates and does not indicate that the windows perform better.

Figures 3 shows that Blackman window suffers from distorted distribution for SNR higher than 40 dB. In addition, the estimates by Blackman window saturate from 30 dB SNR. These results indicate that the proposed six-term cosine series window provides the best estimates. The second best is Nuttallwin-11, because saturation from 50 dB is practically acceptable and it has shorter window length (nominal length: 4) than the six-term window (nominal length: 6).

4 Application to real speech analysis

The proposed procedure, when it is implemented using FFT, works properly for known constant $f_{\rm o}$ signals. However, $f_{\rm o}$ of real speech varies almost all the time. To make apparent $f_{\rm o}$ of such signal constant, we apply an adaptive time axis warping using the fundamental frequency trajectory $f_{\rm o}(t)$ of the signal and the target constant frequency $f_{\rm tgt}$ . The ratio $f_{\rm o}(t)/f_{\rm tgt}$ provides the necessary stretching factor of the time axis. This adaptive time warping also removes side-band components, which are the results of frequency modulation of harmonic components[35, 10, 16]. Algorithm 1 shows the pseudo code of this procedure, where $K$ represents the number of frames and $N(k)$ represents the number of harmonic components at the $k$ -th frame. The coefficient $C_{0}$ represents the calibration constant, which depends on the used window function.

4.1 Example and issues for excitation source design

Figure 4 shows an example SNR map of a Japanese vowel sequence /aiueo/, spoken by a male speaker. The signal was sampled at 22,050 Hz with 16 bit resolution. The SNR in dB value is clipped at 40 dB.

The estimated SNR $SNR(t,\omega)$ is sufficient to describe the signal characteristics. It is useful for generating annotation of physical attribute to speech corpus, for example. However, it is not enough to design an excitation source signal for synthesizing speech. There are four factors to be considered; non-speech background noise, the discrepancy between $f_{\rm o}(t)$ movement and movement of resonance frequencies of the vocal tract, and the auditory filter shape.

The background noise dominates when/where the power of the speech signal is weak. Figure 4 has a dark horizontal curved trace around 4,500 Hz from 0.2 s to 0.6 s. The trace roughly corresponds to the deep spectral dip, owing to pyriform fossa[36]. This dip of speech power may contribute to this dark trace. The background noise also introduces bias in the estimated SNR. This effect has to be compensated before determining the noise level in the excitation source signal.

The discrepancy between $f_{\rm o}(t)$ movement and movement of resonance frequencies of the vocal tract makes the estimated SNR increase at resonance frequencies. It is because the resonance frequencies on the warped time axis move and deviate from precise repetition. These resonances have to be equalized to alleviate this effect before estimating SNR.

The initial stage of the human auditory system is approximately a constant-Q filter bank. Each filter bandwidth is proportional to its center frequency. The SNR map has to be smoothed according to this frequency resolution for designing the excitation source signal. This also reduces the variance of estimates.

In addition to these, non-linear interactions of the vocal tract and the vocal fold vibration, with its temporal fluctuation[37] may introduce apparent aperiodicity around resonant frequencies. Detailed analysis and tuning of the proposed method to the target application, specifically speech synthesis, are issues which need further research.

5 Discussion

The proposed mixed index $\eta(t,\omega)$ can also be implemented using wavelet analysis, similar to YANG vocoder[17]. The SNR estimator based on wavelet analysis can replace the periodicity detector in the first stage of YANG vocoder and can provide the probability map of $f_{\rm o}$ location. This way, the proposed method is applicable to the initial stage of $f_{\rm o}$ estimation. The FFT-based implementation is also applicable to the refinement stage of $f_{\rm o}$ estimation of YANG vocoder.

We did not discuss the temporal distribution of the random component within one pitch period, but it is an important issue[11, 16]. It has to be properly estimated because the temporal location of a noise burst has a significant impact on the perceived noise level[38]. A weighted average of group delay with minimum phase compensation[29] can be applicable to this issue by revisiting derivation of the group delay equation and windowing function design. This is the next research target of aperiodicity analysis.

6 Conclusion

We propose a simple and linear SNR (periodic to noise component ratio) estimator, which is operational spanning from 0 dB to 80 dB SNR range, without using additional linearization. We introduced a six-term cosine series window, which has very low maximum sodelobe level (-114.24 dB) and steep decay rate (54 dB/oct), for implementing the estimator. These will serve as a dependable infrastructure for speech analysis. We are planning to use this estimator for updating our new open source vocoder (YANG vocoder).

7 Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers JP15H03207, JP15H02726 and JP16K12464.

Appendix A Windowing functions

Figure 5 shows the temporal shape and the frequency gain characteristics of used windowing functions. The upper plot shows the shape and the bottom plot shows the gain. The nominal length of each windowing function shown in Table 1 is designed to make the first zero to locate at one on the normalized frequency.

However, the nominal length is not relevant for representing windowing functions. It is relevant to represent their duration and bandwidth in terms of 2nd order moment, $\sigma_{t}$ and $\sigma_{f}$ respectively.

[TABLE]

where the temporal representation $|w(t)|^{2}$ and the frequency representation $|W(f)|^{2}$ are normalized for their integration to yield one.

Table 2 shows the duration and the bandwidth of each windowing function, where $\sigma_{tR}$ represents the duration of a rectangular window of the unit length and $\sigma_{fR}$ represents the bandwidth of a rectangular low-pass filter of unit cutoff frequency.

Appendix B Distribution of estimated SNR

This section provides detailed simulation results which were used to generate Fig. 1, Fig. 3 and Fig. 3. The SNR of the test signal varied from 0 dB to 80 dB in 10 dB steps.

Figure 6 shows the simulation results using the proposed six-term cosine series window. The horizontal axis represents the estimated SNR and the vertical axis represents the probability of the estimated SNR $SNR_{\mathrm{E}}$ yields lower value than the horizontal axis value $SNR_{\mathrm{T}h}$ . In short, the vertical axis represents $P_{r}(SNR_{\mathrm{E}}\leq SNR_{\mathrm{T}h})$ , where $P_{r}(Q)$ represents the probability of logical predicate $Q$ to be true.

Figure 7 shows results using Hanning and Blackman windowing functions. Strong leakage from sidelobes made distribution for higher SNR signals saturate.

Figure 8 shows reults using windows with more attenuated sidelobe levels. The highest sidelobe level of the 15-th item is lower than the 11-th item. For the 11-th item, saturation of the distribution occurs for 60 dB or higher SNR. For the 15-th item, strong distortion of the distribution occurs for 60 dB or higher SNR.

Figure 9 shows the results using theoretically the best localized function, prolate spheroidal wave function and its approximation, Kaiser window.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Fujisaki, “Prosody, Models, and Spontaneous Speech,” in Computing Prosody . New York, NY: Springer US, 1997, pp. 27–42.
2[2] K.-I. Sakakibara, H. Imagawa, T. Kouishi, K. Kondo, E. Z. Murano, M. Kumada, and S. Niimi, “Vocal fold and false vocal fold vibrations in throat singing and synthesis of khöömei,” in ICMC , 2001.
3[3] C. Gobl and A. Ní Chasaide, “The role of voice quality in communicating emotion, mood and attitude,” Speech Communication , vol. 40, no. 1-2, pp. 189–212, 2003.
4[4] O. Fujimura, K. Honda, H. Kawahara, Y. Konparu, M. Morise, and J. C. Williams, “Noh voice quality.” Logopedics, phoniatrics, vocology , vol. 34, no. 4, pp. 157–170, 2009.
5[5] O. Fujimura, “An approximation to voice aperiodicity,” IEEE Transactions on Audio and Electroacoustics , vol. 16, no. 1, pp. 68–72, Mar 1968.
6[6] J. Makhoul, R. Viswanathan, R. Schwartz, and A. Huggins, “A mixed-source model for speech compression and synthesis,” The Journal of the Acoustical Society of America , vol. 64, no. 6, pp. 1577–1581, 1978.
7[7] D. W. Griffin and J. S. Lim, “Multiband excitation vocoder,” IEEE Transactions on Acoustics, Speech, and Signal Processing , vol. 36, no. 8, pp. 1223–1235, Aug 1988.
8[8] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Mixed excitation for hmm-based speech synthesis.” in INTERSPEECH , 2001, pp. 2263–2266.