A Variational EM Method for Pole-Zero Modeling of Speech with Mixed   Block Sparse and Gaussian Excitation

Liming Shi; Jesper Kj{\ae}r Nielsen; Jesper Rindom Jensen; Mads; Gr{\ae}sb{\o}ll Christensen

arXiv:1706.07927·cs.SD·June 27, 2017

A Variational EM Method for Pole-Zero Modeling of Speech with Mixed Block Sparse and Gaussian Excitation

Liming Shi, Jesper Kj{\ae}r Nielsen, Jesper Rindom Jensen, Mads, Gr{\ae}sb{\o}ll Christensen

PDF

Open Access

TL;DR

This paper introduces a novel pole-zero speech modeling approach using a variational EM algorithm to better capture spectral features and excitation characteristics, improving speech analysis accuracy.

Contribution

It proposes a combined block sparse and Gaussian excitation model with a variational EM method for enhanced speech spectral fitting and excitation reconstruction.

Findings

01

Lower spectral distortion compared to traditional methods

02

Effective reconstruction of block sparse excitation

03

Improved speech spectral modeling accuracy

Abstract

The modeling of speech can be used for speech synthesis and speech recognition. We present a speech analysis method based on pole-zero modeling of speech with mixed block sparse and Gaussian excitation. By using a pole-zero model, instead of the all-pole model, a better spectral fitting can be expected. Moreover, motivated by the block sparse glottal flow excitation during voiced speech and the white noise excitation for unvoiced speech, we model the excitation sequence as a combination of block sparse signals and white noise. A variational EM (VEM) method is proposed for estimating the posterior PDFs of the block sparse residuals and point estimates of mod- elling parameters within a sparse Bayesian learning framework. Compared to conventional pole-zero and all-pole based methods, experimental results show that the proposed method has lower spectral distortion and good performance in…

Figures5

Click any figure to enlarge with its caption.

Tables1

Table 1. TABLE I: The spectral distortion

F0	200	250	300	350	400
2-norm LP	1.79	2.14	2.12	2.53	2.13
TS-LS-PZ	2.41	4.77	1.88	1.46	2.86
1-norm LP	2.43	3.15	3.60	3.29	4.29
EM-LP	5.62	6.68	4.68	3.96	4.83
VEM-PZ, D=1	4.50	7.14	2.29	1.54	2.31
VEM-PZ, D=5	1.55	4.47	0.69	2.01	4.50
VEM-PZ, D=7	2.08	4.07	2.18	1.41	1.29
VEM-PZ, D=8	0.77	5.56	2.52	4.86	0.53

Equations48

y (n) = s (n) + u (n),

y (n) = s (n) + u (n),

s (n) = - k = 1 \sum K a_{k} s (n - k) + l = 0 \sum L b_{l} e (n - l) + m (n),

s (n) = - k = 1 \sum K a_{k} s (n - k) + l = 0 \sum L b_{l} e (n - l) + m (n),

y =

y =

As =

Ay = Be + m + Au .

Ay = Be + m + Au .

e min ∥ e ∥_{1}^{1} s.t. y - A^{- 1} Be_{2}^{2} \leq C .

e min ∥ e ∥_{1}^{1} s.t. y - A^{- 1} Be_{2}^{2} \leq C .

Ay = Be + m, m \sim N (0, γ_{m}^{- 1} I_{N}),

Ay = Be + m, m \sim N (0, γ_{m}^{- 1} I_{N}),

e \sim N (0, Γ_{e}^{- 1}), γ_{m} \sim Γ (c, d), α \sim o = 1 \prod O Γ (α_{o}; e, f),

p (y, e, α, γ_{m}) =

p (y, e, α, γ_{m}) =

=

\times Γ (γ_{m}; c, d) o = 1 \prod O Γ (α_{o}; e, f),

q (e, α, γ_{m}) = q (e) q (γ_{m}) o = 1 \prod O q (α_{0}),

q (e, α, γ_{m}) = q (e) q (γ_{m}) o = 1 \prod O q (α_{0}),

q max E_{q} [lo g p (y, e, α, γ_{m})] + H [q],

q max E_{q} [lo g p (y, e, α, γ_{m})] + H [q],

q (e)

q (e)

q (α_{o})

q (γ_{m})

q (e)

q (e)

q (α_{o}) = Γ (\tilde{e}_{o}, \tilde{f}_{o}),

q (α_{o}) = Γ (\tilde{e}_{o}, \tilde{f}_{o}),

q (γ_{m}) = Γ (\tilde{c}, \tilde{d}),

q (γ_{m}) = Γ (\tilde{c}, \tilde{d}),

\mathbf{C}=\left[\begin{array}[]{ccccccc}0&\cdots&0\\ y(1)&\ddots&\vdots\\ \vdots&\ddots&0\\ \vdots&&y(1)\\ \vdots&&\vdots\\ y({N}-1)&\cdots&y({N}-K)\\ \end{array}\right]_{{N}\times{K}}

\mathbf{C}=\left[\begin{array}[]{ccccccc}0&\cdots&0\\ y(1)&\ddots&\vdots\\ \vdots&\ddots&0\\ \vdots&&y(1)\\ \vdots&&\vdots\\ y({N}-1)&\cdots&y({N}-K)\\ \end{array}\right]_{{N}\times{K}}

a min E_{q (e)} ∥ Ay - Be ∥_{2}^{2} ⟺

a min E_{q (e)} ∥ Ay - Be ∥_{2}^{2} ⟺

⟺

a = (C^{T} C)^{- 1} C^{T} (B \tilde{μ} - y) .

a = (C^{T} C)^{- 1} C^{T} (B \tilde{μ} - y) .

\frac{\partial E _{q (e)} ∥ Ay - Be ∥ _{2}^{2}}{\partial b} =

\frac{\partial E _{q (e)} ∥ Ay - Be ∥ _{2}^{2}}{\partial b} =

=

\mathbf{F}=\left[\begin{array}[]{ccccccc}0&\cdots&0\\ e(1)&\ddots&\vdots\\ \vdots&\ddots&0\\ \vdots&&e(1)\\ \vdots&&\vdots\\ e({N}-1)&\cdots&e({N}-L)\\ \end{array}\right]_{{N}\times{L}}

\mathbf{F}=\left[\begin{array}[]{ccccccc}0&\cdots&0\\ e(1)&\ddots&\vdots\\ \vdots&\ddots&0\\ \vdots&&e(1)\\ \vdots&&\vdots\\ e({N}-1)&\cdots&e({N}-L)\\ \end{array}\right]_{{N}\times{L}}

b = (E_{q (e)} [F^{T} F])^{- 1} (E_{q (e)} [F^{T}] Ay - E_{q (e)} [F^{T} e]),

b = (E_{q (e)} [F^{T} F])^{- 1} (E_{q (e)} [F^{T}] Ay - E_{q (e)} [F^{T} e]),

d_{ceps} = n = - S \sum S (c_{n} - \overset{c}{^}_{n})^{2},

d_{ceps} = n = - S \sum S (c_{n} - \overset{c}{^}_{n})^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Blind Source Separation Techniques

Full text

A Variational EM Method for Pole-Zero

Modeling of Speech with Mixed Block Sparse and Gaussian Excitation

Liming Shi, Jesper Kjær Nielsen, Jesper Rindom Jensen and Mads Græsbøll Christensen This work was funded by the Danish Council for Independent Research, grant ID: DFF 4184-00056 Audio Analysis Lab, AD:MT, Aalborg University

Emails: {ls, jkn, jrj, mgc}@create.aau.dk

Abstract

The modeling of speech can be used for speech synthesis and speech recognition. We present a speech analysis method based on pole-zero modeling of speech with mixed block sparse and Gaussian excitation. By using a pole-zero model, instead of the all-pole model, a better spectral fitting can be expected. Moreover, motivated by the block sparse glottal flow excitation during voiced speech and the white noise excitation for unvoiced speech, we model the excitation sequence as a combination of block sparse signals and white noise. A variational EM (VEM) method is proposed for estimating the posterior PDFs of the block sparse residuals and point estimates of modelling parameters within a sparse Bayesian learning framework. Compared to conventional pole-zero and all-pole based methods, experimental results show that the proposed method has lower spectral distortion and good performance in reconstructing of the block sparse excitation.

I Introduction

The modeling of speech has important applications in speech analysis [1], speaker verification [2], speech synthesis[3], etc. Based on the source-filter model, speech is modelled as being produced by a pulse train or white noise for voiced or unvoiced speech, which is further filtered by the speech production filter (SPF) that consists of the vocal tract and lip radiation.

All-pole modeling with a least squares cost function performs well for white noise and low pitch excitation. However, for high pitch excitation, it leads to an all-pole filter with poles close to the unit circle, and the estimated spectrum has a sharper contour than desired [4, 5]. To obtain a robust linear prediction (LP), the Itakura-Saito error criterion [6], the all-pole modeling with a distortionless response at frequencies of harmonics [4], the regularized LP [7] and the short-time energy weighted LP [8] were proposed. Motivated by the compressive sensing research, a least 1-norm criterion is proposed for voiced speech analysis [9], where sparse priors on both the excitation signals and prediction coefficients are utilized. Fast methods and the stability of the 1-norm cost function for spectral envelope estimation are further investigated in [10, 11]. More recently, in [12], the excitation signal of speech is formulated as a combination of block sparse and white noise components to capture the block sparse or white noise excitation separately or simultaneously. An expectation-maximization (EM) algorithm is used to reconstruct the block sparse excitation within a sparse Bayesian learning (SBL) framework [13].

A problem with the all-pole model is that some sounds containing spectral zeros with voiced excitation, such as nasals, or laterals, are poorly estimated by an all-pole model but trivial with a pole-zero (PZ) model [14, 15]. The estimation of the coefficients of the pole-zero model can be obtained separately [16], jointly [17] or iteratively [18]. A 2-norm minimization criterion with Gaussian residuals assumption is commonly used. Frequency domain fitting methods based on a similarity measure is also proposed. Motivated by the logarithmic scale perception of the human auditory system, the logarithmic magnitude function minimization criterion has been proposed [19, 15]. In [19], the nonlinear logarithm cost function is solved by transforming it into a weighted least squares problem. The Gauss-Newton and Quasi-Newton methods for solving it are further investigated in [15]. To consider both the voiced excitation and the PZ model, a speech analysis method based on the PZ model with sparse excitation in noisy conditions is presented [20]. A least 1-norm criterion is used for the coefficient estimation, and sparse deconvolution is applied for deriving sparse residuals.

In this paper, we propose a speech analysis method based on the PZ model with mixed excitation. Using the mixed excitation and PZ modeling together, we combine the advantages of non-sparse and sparse algorithms, and obtain a better fitting for both the excitation and SPF spectrum. Using the PZ model, instead of the all-pole model, a better spectral fitting can be expected. Moreover, we model both the voiced, the unvoiced excitation or a mixture of them by the mixed excitation. Additionally, block sparsity is imposed on the voiced excitation component, motivated by the quasi-periodic and temporal-correlated nature of the glottal excitation [21, 12]. The posterior probability density functions (PDFs) for the sparse excitation and hyper-parameters, as well as point estimates of the PZ model parameters are obtained using the VEM method.

II Signal models

Consider the following general speech observation model:

[TABLE]

where $y(n)$ is the observation signal and $u(n)$ denotes the noise. We assume that the clean speech signal $s(n)$ is produced by the PZ speech production model, i.e.,

[TABLE]

where $a_{k}$ and $b_{l}$ are the modeling coefficients of the PZ model with $b_{0}=1$ , $e\left(n\right)$ is a sparse excitation corresponding to the voiced part and $m(n)$ is the non-sparse Gaussian excitation component corresponding to the unvoiced part. Assuming $s\left(n\right)=0\ \mathrm{for}\ n\leq 0$ and considering one frame of speech signals of $N$ samples, (1) and (2) can be written in matrix forms as

[TABLE]

where $\mathbf{A}$ and $\mathbf{B}$ are the $N\times{N}$ lower triangular Toeplitz matrices with $[1,a_{1},a_{2},\cdots,a_{K},0,\cdots,0]$ and $[1,b_{1},b_{2},\cdots,b_{L},0,\cdots,0]$ as the first columns, respectively. The block sparse residuals are defined as $\mathbf{e}=[e\left(1\right)\cdots e\left({N}\right)]^{T}$ , and $\mathbf{m}$ , $\mathbf{s}$ , $\mathbf{y}$ and $\mathbf{u}$ are defined similarly to $\mathbf{e}$ . When $L=0$ , $\mathbf{B}$ reduces to the identity matrix and (4) becomes the all-pole model. Combining (3) and (4), the noisy observation can be written as

[TABLE]

In [20], we assumed that only the sparse excitation was present ( $\mathbf{m}=\mathbf{0}$ , but $\mathbf{u}\neq\mathbf{0}$ ). The sparse residuals and model parameters were estimated iteratively. The sparse residuals were obtained by solving

[TABLE]

where $C$ is a constant proportional to the variance of the noise. The model parameters was estimated using the $l_{1}$ norm of the residuals as the cost function (see [20] for details).

III Proposed Variational EM method

We now proceed to consider the noise-free scenario but with mixed excitation ( $\mathbf{u}=\mathbf{0}$ , but $\mathbf{m}\neq\mathbf{0}$ ). We consider the pole-zero model parameters $\mathbf{a}=[a_{1},a_{2},\cdots,a_{K}]^{T}$ and $\mathbf{b}=[b_{1},b_{2},\cdots,b_{L}]^{T}$ to be deterministic but unknown. Utilizing the SBL [13] methodology, we first express the hierarchical form of the model as

[TABLE]

where ${O}$ is the number of blocks, $\mathbf{\Gamma}_{e}=\text{diag}(\bm{\alpha})\otimes\mathbf{I}_{{D}}$ , $\otimes$ is the Kronecker product, ${D}$ is the block size, ${N}={D}{O}$ , $\mathcal{N}$ denotes the multivariate normal PDF and ${\Gamma}$ is the Gamma PDF. The hyperparameter $\alpha_{o}$ is the precision of the $o^{\mathrm{th}}$ block, and when it is infinite, the $o^{\mathrm{th}}$ block will be zero. Note that it is trivial to extend the proposed method to any ${D}$ . Moreover, when ${D}=1$ , each element in $\mathbf{e}$ is inferred independently. Here, block sparsity model is used to take the quasi-periodic and temporal-correlated nature of the voiced excitation into account. The $\mathbf{m}$ is used for capturing the white noise excitation from unvoiced speech frame or a mixture of phonations.

Our objective is to obtain the posterior densities of $\mathbf{e}$ , $\gamma_{m}$ and $\bm{\alpha}$ , and point estimates of the model parameters in $\mathbf{a}$ and $\mathbf{b}$ . First, we write the complete likelihood, i.e.,

[TABLE]

where we used $\mathcal{N}(\mathbf{y}|\mathbf{A}^{-1}\mathbf{B}\mathbf{e},\gamma_{m}^{-1}(\mathbf{A}^{T}\mathbf{A})^{-1})=\mathcal{N}(\mathbf{A}\mathbf{y}|\mathbf{B}\mathbf{e},\gamma_{m}^{-1}\mathbf{I}_{N})$ when $\mathrm{det}(\mathbf{A})=1$ . Instead of finding the joint posterior density $p(\mathbf{e},\bm{\alpha},\gamma_{m}|\mathbf{y})$ , which is intractable, we adopt the variational approximation [22]. Assume that $p(\mathbf{e},\bm{\alpha},\gamma_{m}|\mathbf{y})$ is approximated by the density $q(\mathbf{e},\bm{\alpha},\gamma_{m})$ , which may be fully factorized as

[TABLE]

where the factors are found using an EM-like algorithm [22].

In the E-step of the VEM method, we fix the model parameters $\mathbf{a}$ and $\mathbf{b}$ , and re-formulate the posterior PDFs estimation problem as maximizing the variational lower bound

[TABLE]

where $q$ is the shorthand for $q(\mathbf{e},\bm{\alpha},\gamma_{m})$ , $H[q]$ is defined as $H[q]=-\mathbb{E}_{q}[\log(q)]$ , and $\mathbb{E}_{q(x)}[f(x)]$ denotes the expectation of $f(x)$ w.r.t. the random variable $x$ (i.e., $\mathbb{E}_{q(x)}[f(x)]=\int f(x)q(x)dx$ ). Substituting (III) and (9) into (10), and following the derivation from [22], we obtain

[TABLE]

It is clearly seen that $q(\mathbf{e})$ in (III) is a Gaussian PDF, i.e.,

[TABLE]

where $\bm{\tilde{\Sigma}}=(\mathbb{E}[\gamma_{m}]\mathbf{B}^{T}\mathbf{B}+\mathbb{E}[\mathbf{\Gamma}_{e}])^{-1}$ and $\bm{\tilde{\mu}}=\mathbb{E}[\gamma_{m}]\bm{\tilde{\Sigma}}\mathbf{B}^{T}\mathbf{A}\mathbf{y}$ . We also define the auto-correlation matrix $\mathbf{\tilde{R}}=\bm{\tilde{\Sigma}}+\bm{\tilde{\mu}}\bm{\tilde{\mu}}^{T}$ . The posterior PDF of $\alpha_{o}$ in (III) is a Gamma probability density, i.e.,

[TABLE]

where $\tilde{e}_{o}=e+D/2$ , $\tilde{f}_{o}=f+\sum_{i=(o-1)D+1}^{oD}\mathbf{\tilde{R}}_{i,i}/2$ and $\mathbf{\tilde{R}}_{i,i}$ denotes the $(i,i)$ element of $\mathbf{\tilde{R}}$ . The expectation of the precision matrix is $\mathbb{E}[\mathbf{\Gamma}_{e}]=\mathrm{diag}(\tilde{e}_{1}/\tilde{f}_{1},\cdots,\tilde{e}_{O}/\tilde{f}_{O})\otimes\mathbf{I}_{{D}}$ . Similar to $\alpha_{o}$ , the posterior PDF of $\gamma_{m}$ is

[TABLE]

where $\tilde{c}=c+{N}/2$ , $\tilde{d}=d+(\mathrm{tr}(\bm{\tilde{\Sigma}}\mathbf{B}^{T}\mathbf{B})+\|\mathbf{A}\mathbf{y}-\mathbf{B}\bm{\tilde{\mu}}\|_{2}^{2})/2$ . The expectation of $\gamma_{m}$ can be expressed as $\mathbb{E}[\gamma_{m}]=\tilde{c}/\tilde{d}$ .

In the M-step, we maximize the lower bound (10) w.r.t. the modeling parameters $\mathbf{a}$ and $\mathbf{b}$ , respectively. The optimization problems can be shown to be equivalent to $\min\limits_{\mathbf{a}}\mathbb{E}_{q(\mathbf{e})}\|\mathbf{A}\mathbf{y}-\mathbf{B}\mathbf{e}\|_{2}^{2}\ \mathrm{and}\ \min\limits_{\mathbf{b}}\mathbb{E}_{q(\mathbf{e})}\|\mathbf{A}\mathbf{y}-\mathbf{B}\mathbf{e}\|_{2}^{2}$ , respectively. To obtain the estimate for $\mathbf{a}$ , we first note that $\mathbf{Ay}$ can be expressed as $\mathbf{Ay}=\mathbf{C}\mathbf{a}+\mathbf{y}$ , where $\mathbf{C}$ is a ${N}\times{K}$ Toeplitz matrix of the form

[TABLE]

Using this expression and $q(\mathbf{e})$ obtained in the E-step, the minimization problem w.r.t. $\mathbf{a}$ can be re-formulated as

[TABLE]

As can be seen, (III) is the standard least squares problem and has the analytical solution as

[TABLE]

We can obtain the solution of $\mathbf{b}$ , like $\mathbf{a}$ , by setting the derivative of ${\mathbb{E}_{q(\mathbf{e})}\|\mathbf{A}\mathbf{y}-\mathbf{B}\mathbf{e}\|_{2}^{2}}$ w.r.t. $\mathbf{b}$ to zero, i.e.,

[TABLE]

where $\mathbf{F}$ is an ${N}\times{L}$ lower triangular Toeplitz matrix of the form

[TABLE]

From (III), we obtain the estimate of $\mathbf{b}$ , i.e.,

[TABLE]

where $\mathbb{E}_{q(\mathbf{e})}[\mathbf{F}^{T}\mathbf{F}]$ is an $L\times L$ symmetric matrix with the $(i,j)^{\mathrm{th}},j\geq i$ element given by $\sum_{k=1}^{{N}-j}\mathbf{\tilde{R}}_{k,k+j-i}$ . The $\mathbb{E}_{q(\mathbf{e})}[\mathbf{F}^{T}]$ can be obtained by simply replacing the stochastic variable $e(n),1\leq n\leq{N}-1$ in $\mathbf{F}^{T}$ with the mean estimate $\tilde{\mu}(n)$ (the $n^{\mathrm{th}}$ element in $\bm{\tilde{\mu}}$ ). The $\mathbb{E}_{q(\mathbf{e})}[\mathbf{F}^{T}\mathbf{e}]$ is an $L\times 1$ vector with the $l^{\mathrm{th}}$ element given by $\sum_{k=1}^{{N}-l}\mathbf{\tilde{R}}_{k,k+l}$ . Note that the estimation of $\mathbf{b}$ in (18) requires the knowledge of $\mathbf{a}$ and vice versa (see (16)). This coupling is solved by replacing them with their estimates from previous iteration. The algorithm is initialized with $\mathbf{a}=[1,0,\cdots,0_{K}]^{T}$ , $\mathbf{b}=[1,0,\cdots,0_{L}]^{T}$ , $\gamma_{m}=10$ and $\alpha_{o}=1,o=1,\cdots,O$ , and starts with the update of $\mathbf{e}$ . We refer to the proposed variational expectation maximization pole-zero estimation algorithm as the VEM-PZ.

IV Results

In this section, we compare the performance of the proposed VEM-PZ, the two-stage least squares pole-zero (TS-LS-PZ) method [14], 2-norm linear prediction (2-norm LP) [1], 1-norm linear prediction (1-norm LP)[9] and expectation maximization based linear prediction (EM-LP) for mixed excitation [12] in both synthetic and real speech signals analysis scenarios.

IV-A Synthetic signal analysis

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE , vol. 63, no. 4, pp. 561–580, 1975.
2[2] J. Pohjalainen, C. Hanilci, T. Kinnunen, and P. Alku, “Mixture linear prediction in speaker verification under vocal effort mismatch,” IEEE Signal Process. Lett. , vol. 21, no. 12, pp. 1516–1520, dec 2014.
3[3] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics plus noise model based vocoder for statistical parametric speech synthesis,” IEEE J. Sel. Topics Signal Process. , vol. 8, no. 2, pp. 184–194, 2014.
4[4] M. N. Murthi and B. D. Rao, “All-pole modeling of speech based on the minimum variance distortionless response spectrum,” IEEE Trans. Speech Audio Process. , vol. 8, no. 3, pp. 221–239, 2000.
5[5] T. Drugman and Y. Stylianou, “Fast inter-harmonic reconstruction for spectral envelope estimation in high-pitched voices,” IEEE Signal Process. Lett. , vol. 21, no. 11, pp. 1418–1422, 2014.
6[6] A. El-Jaroudi and J. Makhoul, “Discrete all-pole modeling,” IEEE Trans. Signal Process. , vol. 39, no. 2, pp. 411–423, 1991.
7[7] L. A. Ekman, W. B. Kleijn, and M. N. Murthi, “Regularized linear prediction of speech,” vol. 16, no. 1, pp. 65–73, 2008.
8[8] P. Alku, J. Pohjalainen, M. Vainio, A. M. Laukkanen, and B. H. Story, “Formant frequency estimation of high-pitched vowels using weighted linear prediction.” J. Acoust. Soc. Am. , vol. 134, no. 2, pp. 1295–1313, Auguest 2013.