Learning Waveform-Based Acoustic Models using Deep Variational   Convolutional Neural Networks

Dino Oglic; Zoran Cvetkovic; Peter Sollich

arXiv:1906.09526·stat.ML·August 17, 2021

Learning Waveform-Based Acoustic Models using Deep Variational Convolutional Neural Networks

Dino Oglic, Zoran Cvetkovic, Peter Sollich

PDF

1 Repo

TL;DR

This paper introduces a stochastic deep convolutional neural network for waveform-based acoustic modeling in speech recognition, leveraging variational inference and adaptive filters to improve robustness and performance over existing methods.

Contribution

It proposes a novel waveform-based acoustic model using deep variational CNNs with adaptive parametric filters and an effective approximation for variational inference, enhancing robustness.

Findings

01

Outperforms comparable waveform-based baselines.

02

Achieves better results than standard FBANK feature-based models.

03

Demonstrates robustness improvements in speech recognition.

Abstract

We investigate the potential of stochastic neural networks for learning effective waveform-based acoustic models. The waveform-based setting, inherent to fully end-to-end speech recognition systems, is motivated by several comparative studies of automatic and human speech recognition that associate standard non-adaptive feature extraction techniques with information loss which can adversely affect robustness. Stochastic neural networks, on the other hand, are a class of models capable of incorporating rich regularization mechanisms into the learning process. We consider a deep convolutional neural network that first decomposes speech into frequency sub-bands via an adaptive parametric convolutional block where filters are specified by cosine modulations of compactly supported windows. The network then employs standard non-parametric 1D convolutions to extract relevant spectro-temporal…

Tables5

Table 1. TABLE I: The table reports the average phoneme error rates (standard deviations are provided in the brackets), obtained using variational parznets 1d and Gaussian mean field (variational) inference on the timit dataset.

	non-adaptive mel-filters kl approximation: hermite-gauss quad.	adaptive filters kl approximation: hermite-gauss quad.	adaptive filters kl approximation: molchanov et al [38]	adaptive filters kl approximation: hermite-gauss quad.	adaptive filters kl approximation: hermite-gauss quad.	adaptive filters kl approximation: mcmc [15]
sample	vi – log-scale uniform				vi – scale mixture
	squared epanechnikov			gauss	squared epanechnikov
dev	$15.02$ ( $\pm 0.26$ )	$14.95$ ( $\pm 0.14$ )	14.77 ( $\pm 0.15$ )	$14.83$ ( $\pm 0.13$ )	$15.64$ ( $\pm 0.11$ )	$15.58$ ( $\pm 0.20$ )
test	$16.95$ ( $\pm 0.25$ )	16.52 ( $\pm 0.22$ )	$16.63$ ( $\pm 0.23$ )	$16.60$ ( $\pm 0.22$ )	$17.41$ ( $\pm 0.17$ )	$17.56$ ( $\pm 0.16$ )

Table 2. TABLE II: aurora4 , word error rates obtained using different test samples.

	vi – log-scale uniform				vi – scale mixture
	8 x cnn			10 x cnn	8 x cnn
adaptive filters		$✓$	$✓$	$✓$	$✓$	$✓$
kl: hermite-gauss	$✓$	$✓$		$✓$	$✓$
kl: molchanov et al.			$✓$
kl: mcmc						$✓$
a. same microphone
clean (a)	$3.05$	$2.88$	$2.84$	$2.78$	$3.12$	2.71
b. same microphone
car	$3.29$	$3.34$	$3.14$	3.10	$3.29$	$3.25$
babble	$4.63$	$4.33$	$4.84$	4.26	$4.54$	$4.84$
restaurant	$6.46$	6.00	$6.18$	$6.54$	$6.65$	$6.37$
street	$5.87$	$5.87$	$5.88$	5.70	$6.22$	$6.16$
airport	$4.76$	$4.45$	$4.58$	4.43	$4.78$	$4.61$
train	$6.41$	$6.33$	6.30	$6.35$	6.30	$6.35$
average (b)	$5.24$	5.05	$5.15$	$5.06$	$5.30$	$5.26$
c. different microphones
clean (c)	$5.90$	$5.59$	$6.02$	5.27	$6.09$	$5.96$
d. different microphones
car	$9.79$	$9.30$	$9.36$	9.10	$9.84$	$10.14$
babble	$15.84$	$15.41$	$16.01$	14.78	$16.07$	$16.16$
restaurant	$20.08$	$20.77$	$21.39$	19.56	$21.15$	$21.24$
street	$17.31$	16.80	$17.71$	$17.28$	$17.65$	$18.61$
airport	$14.70$	$13.88$	$14.65$	13.30	$14.70$	$14.94$
train	$17.43$	16.99	$17.49$	$17.07$	$17.64$	$17.90$
average (d)	$15.86$	$15.53$	$16.10$	15.18	$16.18$	$16.50$
average (all)	$9.68$	$9.42$	$9.74$	9.25	$9.86$	$9.95$

Table 3. TABLE III: Comparison of phoneme error rates obtained in our experiments on timit to the ones reported for relevant feedforward nets.

method	avg	min
a. raw speech baselines (optimized filters)
variational parznets	16.5	16.2
deterministic parznets	$17.7$	$17.5$
sincnet [23, 69]	$17.5$	$17.2$
sinc²net [58]	$-$	$16.9$
end-to-end cnn [59]	$-$	$18.0$
raw speech cnn [23]	$18.3$	$18.1$
b. standard features (non-adaptive filters)
fmllr + mlp	$16.9$	$16.7$
mfcc + mlp [70]	$18.1$	$17.8$
multi-res dss + cnn & mlp [71]	$-$	$17.4$

Table 4. TABLE IV: Word error rates obtained on aurora4 using multi-condition training and input/context frames of 200 200 200 ms ( a : clean speech with same microphone, b : noisy speech with same microphone, c : clean speech with different microphones, d : noisy speech with different microphones).

method	a	b	c	d	avg
a. raw speech & var. baselines (optimized filters)
dnn alignments
var. parznets (10 x cnn1d)	$2.22$	$4.50$	$4.71$	$14.72$	8.73
det. parznets (10 x cnn1d)	$2.35$	$4.73$	$4.86$	$15.48$	$9.17$
var. parznets (8 x cnn1d)	$2.15$	$4.50$	$5.28$	$15.07$	$8.92$
det. parznets (8 x cnn1d)	$2.24$	$4.61$	$5.75$	$15.48$	$9.18$
gmm alignments
var. parznets (10 x cnn1d)	$2.78$	$5.06$	$5.27$	$15.18$	$9.25$
var. parznets (8 x cnn1d)	$2.88$	$5.05$	$5.59$	$15.53$	$9.42$
sincnet [69]	$3.42$	$6.33$	$6.13$	$16.99$	$10.68$
cvae feats + mlp [55, 56]	$3.50$	$7.40$	$6.90$	$17.10$	$11.20$
b. standard features (non-adaptive filters)
fbank + vd10 x cnn2d [21]	$4.13$	$6.62$	$5.92$	$14.53$	$9.78$
fbank + vd8 x cnn2d [21]	$3.72$	$6.57$	$5.83$	$14.79$	$9.84$
fmllr + mlp	$3.34$	$6.27$	$5.74$	$16.04$	$10.21$
mfcc + mlp	$4.28$	$7.44$	$8.73$	$18.71$	$12.14$
dss (utt. norm.) + junct. net	$3.05$	$5.82$	$6.11$	$15.94$	$9.98$
dss (w/o norm.) + junct. net	$4.09$	$6.35$	$8.24$	$19.07$	$11.78$

Table 5. TABLE V: The word error rates obtained on dev and eval sets of ami-ihm with various input features and neural architectures. We did not use any data augmentation techniques or i-vectors in the experiments. Following the original Kaldi recipe, a 3-gram language model built from the ami and fisher data was adopted. Some of the related baselines relied on a contextually more expressive 4-gram language model, and were compiled solely using the ami data. The column size refers to an approximate number of differentiable parameters in the respective neural architectures.

architecture	dev	eval	lm	size
a. raw speech baselines (adaptive filters)
var. parznets (10 x cnn1d)	24.7	25.7	3-gram	$17.4$ M
det. parznets (10 x cnn1d)	$25.0$	$26.4$	3-gram	$8.7$ M
var. parznets (8 x cnn1d)	$25.1$	$26.4$	3-gram	$19.0$ M
det. parznets (8 x cnn1d)	$25.9$	$27.7$	3-gram	$9.5$ M
sincnet [77]	$28.0$	$30.2$	3-gram	$9.0$ M
multi-span-dnn [22]	$27.2$	$29.3$	4-gram	$4.7$ M
b. standard features (non-adaptive filters)
fbank-mlp [22]	$28.3$	$31.1$	4-gram	$3.0$ M
fmllr-mlp	$26.0$	$27.1$	3-gram	$8.5$ M
tdnn [78]	$25.3$	$26.0$	3-gram	$7.7$ M

Equations46

k_{γ} (t) = max {0, 1 - γ t^{2}}^{2},

k_{γ} (t) = max {0, 1 - γ t^{2}}^{2},

ϕ_{η, γ} (t) = cos (2 π η t) \cdot k_{γ} (t) .

ϕ_{η, γ} (t) = cos (2 π η t) \cdot k_{γ} (t) .

∥ Φ (f) - Φ (D_{τ} f) ∥_{H} \leq L ∥ I - D_{τ} ∥_{\infty} ∥ f ∥ : = L (t \in Ω sup ∥ \nabla τ (t) ∥ + t \in Ω sup ∥ \nabla\nabla τ (t) ∥) ∥ f ∥,

∥ Φ (f) - Φ (D_{τ} f) ∥_{H} \leq L ∥ I - D_{τ} ∥_{\infty} ∥ f ∥ : = L (t \in Ω sup ∥ \nabla τ (t) ∥ + t \in Ω sup ∥ \nabla\nabla τ (t) ∥) ∥ f ∥,

\displaystyle\Phi\left({x}\right)=\Big{(}{\rho_{l}\circ\rho_{l-1}\circ\dots\rho_{1}}\Big{)}\left({x}\right)\ ,

\displaystyle\Phi\left({x}\right)=\Big{(}{\rho_{l}\circ\rho_{l-1}\circ\dots\rho_{1}}\Big{)}\left({x}\right)\ ,

∥ W z + b - W z^{'} - b ∥_{2} \leq L ∥ z - z^{'} ∥_{2} .

∥ W z + b - W z^{'} - b ∥_{2} \leq L ∥ z - z^{'} ∥_{2} .

1 \leq j \leq k max σ_{i, j} (z) - 1 \leq j \leq k max σ_{i, j} (z^{'}) \leq ∥ σ_{i} (z) - σ_{i} (z^{'}) ∥_{2} .

1 \leq j \leq k max σ_{i, j} (z) - 1 \leq j \leq k max σ_{i, j} (z^{'}) \leq ∥ σ_{i} (z) - σ_{i} (z^{'}) ∥_{2} .

∣ ι (z) - ι (z^{'}) ∣ = 1 \leq j \leq k max σ_{i, j} (z) - 1 \leq j \leq k max σ_{i, j} (z^{'}) \leq σ_{i, j_{0}} (z) - σ_{i, j_{0}} (z^{'}) \leq 1 \leq j \leq k max (σ_{i, j} (z) - σ_{i, j} (z^{'})) \leq ∥ σ_{i} (z) - σ_{i} (z^{'}) ∥_{\infty} \leq ∥ σ_{i} (z) - σ_{i} (z^{'}) ∥_{2} .

∣ ι (z) - ι (z^{'}) ∣ = 1 \leq j \leq k max σ_{i, j} (z) - 1 \leq j \leq k max σ_{i, j} (z^{'}) \leq σ_{i, j_{0}} (z) - σ_{i, j_{0}} (z^{'}) \leq 1 \leq j \leq k max (σ_{i, j} (z) - σ_{i, j} (z^{'})) \leq ∥ σ_{i} (z) - σ_{i} (z^{'}) ∥_{\infty} \leq ∥ σ_{i} (z) - σ_{i} (z^{'}) ∥_{2} .

lo g p (Δ ∣ X_{n}, Y_{n}) \propto lo g p_{r} (Δ ∣ η) + i = 1 \sum n lo g p (y_{i} ∣ x_{i}, Δ) .

lo g p (Δ ∣ X_{n}, Y_{n}) \propto lo g p_{r} (Δ ∣ η) + i = 1 \sum n lo g p (y_{i} ∣ x_{i}, Δ) .

q \in Q min KL (q ∣∣ p_{r}) - i = 1 \sum n E_{Δ \sim q (Δ ∣ μ, σ)} [lo g p (y_{i} ∣ x_{i}, Δ)],

q \in Q min KL (q ∣∣ p_{r}) - i = 1 \sum n E_{Δ \sim q (Δ ∣ μ, σ)} [lo g p (y_{i} ∣ x_{i}, Δ)],

L_{n} (q) = i = 1 \sum n E_{Δ \sim q (Δ ∣ μ, σ)} [lo g p (y_{i} ∣ x_{i}, Δ)]

L_{n} (q) = i = 1 \sum n E_{Δ \sim q (Δ ∣ μ, σ)} [lo g p (y_{i} ∣ x_{i}, Δ)]

L_{n} (q) \approx \tilde{L}_{m} (q) = \frac{n}{m} i = 1 \sum m lo g p (y_{i} ∣ x_{i}, Δ),

L_{n} (q) \approx \tilde{L}_{m} (q) = \frac{n}{m} i = 1 \sum m lo g p (y_{i} ∣ x_{i}, Δ),

J = \int_{- \infty}^{\infty} h (u) exp (- u^{2}) d u,

J = \int_{- \infty}^{\infty} h (u) exp (- u^{2}) d u,

E_{s} (h) = \int_{- \infty}^{\infty} h (u) exp (- u^{2}) d u - i = 1 \sum s w_{i} h (u_{i}) = \frac{s ! \cdot π}{2 ^{s} \cdot ( 2 s )!} h^{(2 s)} (\overset{u}{^}),

E_{s} (h) = \int_{- \infty}^{\infty} h (u) exp (- u^{2}) d u - i = 1 \sum s w_{i} h (u_{i}) = \frac{s ! \cdot π}{2 ^{s} \cdot ( 2 s )!} h^{(2 s)} (\overset{u}{^}),

KL (q ∣∣ p_{r, lsu}) \approx - \nicefrac 12 lo g α + \nicefrac 1 π i = 1 \sum s w_{i} lo g ∣ v_{i} ∣ + const.,

KL (q ∣∣ p_{r, lsu}) \approx - \nicefrac 12 lo g α + \nicefrac 1 π i = 1 \sum s w_{i} lo g ∣ v_{i} ∣ + const.,

\displaystyle\operatorname*{\mathrm{KL}}\left({q\mid\mid p_{r,\mathrm{lsu}}}\right)=\mathbb{E}_{\mathcal{N}\left({\epsilon\mid 1,\alpha}\right)}\Big{[}{\log\left|{\epsilon}\right|}\Big{]}-\frac{1}{2}\log\alpha+\mathrm{const.}

\displaystyle\operatorname*{\mathrm{KL}}\left({q\mid\mid p_{r,\mathrm{lsu}}}\right)=\mathbb{E}_{\mathcal{N}\left({\epsilon\mid 1,\alpha}\right)}\Big{[}{\log\left|{\epsilon}\right|}\Big{]}-\frac{1}{2}\log\alpha+\mathrm{const.}

\displaystyle\begin{aligned} &\mathbb{E}_{\mathcal{N}\left({\epsilon\mid 1,\alpha}\right)}\Big{[}{\log\left|{\epsilon}\right|}\Big{]}=&\\ &\frac{1}{\sqrt{2\pi\alpha}}\int\exp\left({-\frac{\left({\epsilon-1}\right)^{2}}{2\alpha}}\right)\log\left|{\epsilon}\right|\,\mathrm{d}{\epsilon}=&\\ &\frac{1}{\sqrt{\pi}}\int\log\left|{\sqrt{2\alpha}t+1}\right|\exp\left({-t^{2}}\right)\,\mathrm{d}{t}\ .&\end{aligned}

\displaystyle\begin{aligned} &\mathbb{E}_{\mathcal{N}\left({\epsilon\mid 1,\alpha}\right)}\Big{[}{\log\left|{\epsilon}\right|}\Big{]}=&\\ &\frac{1}{\sqrt{2\pi\alpha}}\int\exp\left({-\frac{\left({\epsilon-1}\right)^{2}}{2\alpha}}\right)\log\left|{\epsilon}\right|\,\mathrm{d}{\epsilon}=&\\ &\frac{1}{\sqrt{\pi}}\int\log\left|{\sqrt{2\alpha}t+1}\right|\exp\left({-t^{2}}\right)\,\mathrm{d}{t}\ .&\end{aligned}

p_{r, sm} (Δ_{i} ∣ ξ, η_{1}, η_{2}, λ) = λ \cdot N (Δ_{i} ∣ ξ, η_{1}^{2}) + (1 - λ) \cdot N (Δ_{i} ∣ ξ, η_{2}^{2}),

p_{r, sm} (Δ_{i} ∣ ξ, η_{1}, η_{2}, λ) = λ \cdot N (Δ_{i} ∣ ξ, η_{1}^{2}) + (1 - λ) \cdot N (Δ_{i} ∣ ξ, η_{2}^{2}),

KL (q ∣∣ p_{r, sm}) \approx - lo g 2 π α μ^{2} - \nicefrac 1 π i = 1 \sum s w_{i} lo g p_{r, sm} (v_{i}) - \nicefrac 12,

KL (q ∣∣ p_{r, sm}) \approx - lo g 2 π α μ^{2} - \nicefrac 1 π i = 1 \sum s w_{i} lo g p_{r, sm} (v_{i}) - \nicefrac 12,

KL (q ∣∣ p_{r, sm}) = \int q (u) lo g q (u) d u - \int q (u) lo g p_{r, sm} (u) d u = - H (q) - E_{q} [lo g p_{r, sm} (u)],

KL (q ∣∣ p_{r, sm}) = \int q (u) lo g q (u) d u - \int q (u) lo g p_{r, sm} (u) d u = - H (q) - E_{q} [lo g p_{r, sm} (u)],

q (u) = \frac{1}{2 π α μ ^{2}} exp (- \frac{( u - μ ) ^{2}}{2 α μ ^{2}}) .

q (u) = \frac{1}{2 π α μ ^{2}} exp (- \frac{( u - μ ) ^{2}}{2 α μ ^{2}}) .

H (q) = lo g 2 π α μ^{2} + \nicefrac 12 .

H (q) = lo g 2 π α μ^{2} + \nicefrac 12 .

E_{q} [lo g p_{r, sm} (u)] = \frac{1}{2 π α μ ^{2}} \int exp (- \frac{( u - μ ) ^{2}}{2 α μ ^{2}}) lo g p_{r, sm} (u) d u = \frac{1}{π} \int lo g p_{r, sm} (2 α μ^{2} t + μ) exp (- t^{2}) d t .

E_{q} [lo g p_{r, sm} (u)] = \frac{1}{2 π α μ ^{2}} \int exp (- \frac{( u - μ ) ^{2}}{2 α μ ^{2}}) lo g p_{r, sm} (u) d u = \frac{1}{π} \int lo g p_{r, sm} (2 α μ^{2} t + μ) exp (- t^{2}) d t .

lo g p \to lo g ((1 - 2 κ) p + κ),

lo g p \to lo g ((1 - 2 κ) p + κ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://bitbucket.org/doglic/asr
mxnetOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDropout

Full text

Learning Waveform-Based Acoustic Models using Deep Variational Convolutional Neural Networks

Dino Oglic, Zoran Cvetkovic, and Peter Sollich D. Oglic and Z. Cvetkovic are with the Department of Engineering, King’s College London. Correspondence to: [email protected]. Sollich is with the Department of Mathematics, King’s College London, and the Institute for Theoretical Physics, University of Göttingen.

Abstract

We investigate the potential of stochastic neural networks for learning effective waveform-based acoustic models. The waveform-based setting, inherent to fully end-to-end speech recognition systems, is motivated by several comparative studies of automatic and human speech recognition that associate standard non-adaptive feature extraction techniques with information loss, which can adversely affect robustness. Stochastic neural networks, on the other hand, are a class of models capable of incorporating rich regularization mechanisms into the learning process. We consider a deep convolutional neural network that first decomposes speech into frequency sub-bands via an adaptive parametric convolutional block where filters are specified by cosine modulations of compactly supported windows. The network then employs standard non-parametric 1d convolutions to extract relevant spectro-temporal patterns while gradually compressing the structured high dimensional representation generated by the parametric block. We rely on a probabilistic parametrization of the proposed neural architecture and learn the model using stochastic variational inference. This requires evaluation of an analytically intractable integral defining the Kullback–Leibler divergence term responsible for regularization, for which we propose an effective approximation based on the Gauss–Hermite quadrature. Our empirical results demonstrate a superior performance of the proposed approach over comparable waveform-based baselines and indicate that it could lead to robustness. Moreover, the approach outperforms a recently proposed deep convolutional neural network for learning of robust acoustic models with standard fbank features.

Index Terms:

Convolutional neural networks, parametric filters, variational inference, waveform-based speech recognition.

I Introduction

Automatic speech recognition systems typically operate in low-dimensional feature spaces designed to achieve invariances inherent to speech production and human speech recognition [1, 2, 3]. Log Mel-filter bank values (fbank) and their de-correlated variant known as Mel-frequency cepstral coefficients (mfcc) are two of the most frequently used feature extraction techniques of this kind [4, 5]. Several comparative studies of automatic and human speech recognition [6, 7, 8] suggest that the information loss inherent to such feature extraction techniques can adversely affect robustness to standard environmental distortions arising from additive and channel (linear filtering) noise [9, 10]. Motivated by this, we propose an effective and principled approach for learning of robust acoustic models in the waveform domain. A difficulty in the waveform setting is the sheer size of the training data required for learning effective waveform-based models. More specifically, the requirement for more than $2,000$ hours of speech in [11, 12] translates into weeks of training on a typical device with gpu support. Our aim is to tackle this problem by incorporating relevant inductive bias into the learning process and allow for learning of effective waveform-based acoustic models using moderately sized datasets. There are two components in our approach, one dealing with the design of neural architectures and the other with learning of the corresponding parameters.

Section II is concerned with the design of neural architecture, which should perform automatic feature extraction by avoiding fast compression schemes associated with information loss when operating with standard non-adaptive filterbank features [6, 7, 8]. We design the neural network as a Lipschitz continuous operator that maps speech waveform frames into a feature space in such a way that small perturbations in the inputs caused by local translations and diffeomorphisms result in relatively small changes in the pre-softmax network outputs. As we operate in the waveform domain, the first layer of our convolutional network extracts information relevant for discrimination between phonetic units by decomposing a speech frame into frequency sub-bands using a set of parametric band-pass filters. The filters are defined by cosine modulations of compactly supported windows and allow for embedding of waveform signals into a structured high-dimensional space where we hypothesize that phonetic units will be easier to separate. The network then employs standard 1d convolutional layers with non-parametric filters for extraction of relevant spectro-temporal patterns while gradually compressing the structured representation generated by the sub-band decomposition. The outputs of the last such convolutional block are passed to a multi-layer perceptron (mlp) with a softmax output.

The learning component of our approach is described in Section III. We propose to learn a probabilistic parametrization of our architecture using variational inference. The motivation for this comes from the fact that for robustness one needs to be able to select the operator mapping with a good Lipschitz constant. The role of probabilistic parametrization and variational inference is to regularize the training process, thus allowing us to learn a robust feature representation of speech signals. This is different from a typical acoustic model, which employs an artificial neural network with real-valued parameters. Such a deterministic parametrization of the network fails to capture the uncertainty of individual parameters and their importance for the learning task. Bayesian machine learning provides a principled framework for modeling uncertainty by finding plausible models that could explain the observed data [13, 14]. In particular, a (deterministic) neural network with fixed parameter values models the conditional probability of a sub-phonetic unit given a speech frame. In stochastic neural networks one additionally assumes that the parameters follow some prior distribution. The latter coupled with the aforementioned likelihood gives rise to a posterior distribution of parameter values conditioned on the observed data. Such posteriors are typically defined via analytically intractable integrals that can be approximated using scalable inference techniques such as stochastic variational inference [15, 16, 17]. In particular, the main idea is to approximate intractable posteriors by optimizing over parameters of an a priori selected family of variational distributions. The optimization objective in variational inference consists of two terms: i) the expected negative log-likelihood of the model, where the expectation is taken with respect to the variational distribution, and ii) the Kullback–Leibler divergence that performs regularization. The expectation in the first term is approximated by sampling the variational distribution, which is typically given by a Gaussian mean field. In this way, the variational formulation injects randomness into the forward pass that computes the loss associated with a particular mini-batch. As a result, stochastic neural networks can capture parameter uncertainty and are less sensitive to perturbations in parameter values, as well as less susceptible to over-fitting [15, 17]. A further regularization effect, incorporated via the Kullback–Leibler divergence, is specified by an analytically intractable integral. For this we propose an effective approximation based on the Gauss–Hermite quadrature. Variational inference has been used previously in speech recognition, albeit in a different context, to maintain the balance between a dataset size and model complexity [18, 19]. In addition to this, a high correlation between the uncertainty in individual parameters and their importance for speech recognition has been observed in stochastic recurrent nets [20, 17]. Previous work, however, does not operate in the waveform domain, focuses on recurrent nets and considers variational inference separately from the properties encoded into the architecture (i.e. Lipschitz continuity in our case).

In Section IV, we focus on the relationship with prior work on speech recognition in the waveform domain. We then evaluate the proposed approach empirically on three benchmark datasets for automatic speech recognition: timit, aurora4, and ami-ihm. A summary of our empirical results is provided in Section V. The ablation study (evaluating the effectiveness of individual components in our approach) demonstrates that acoustic models based on modulation filter learning can be more effective, in a statistically significant way, than the ones with non-adaptive filters. Moreover, the experiments indicate that the proposed approximation scheme based on the Gauss–Hermite quadrature provides a general (with respect to the choice of prior function) and effective means for approximating the Kullback–Leibler divergence term. The experiments on the timit dataset demonstrate that the approach does not over-fit despite using a rather large network on what in speech recognition is considered to be a small dataset. Moreover, our results on aurora4 show that the approach is capable of learning a noise robust model, outperforming significantly the state-of-the-art baselines for waveform-based speech recognition on this dataset. It is also promising that on the same dataset the approach outperforms a recently proposed deep convolutional network for learning of robust acoustic models with standard fbank features [21]. The experiments on ami (conversational speech, without $\mathrm{i}$ - $\mathrm{vectors}$ or data augmentation) show that the approach outperforms recently proposed architectures for raw speech (see [22] and [23]) and performs on par with a state-of-the-art fbank/mfcc based deep time-delay neural network (tdnn) model [24]. Thus, our empirical contributions provide comprehensive evidence for the effectiveness of variational neural networks operating directly in the waveform domain.

II Parznets — Deep Convolutional Neural Networks for Waveform-based Speech Recognition

This section describes an artificial neural network for learning acoustic models in the waveform domain. We first provide a brief overview of the relevant building blocks of the architecture (Section II-A) and then introduce a parametric convolutional layer responsible for decomposition of speech signals into frequency sub-bands (Section II-B). The section concludes with a theoretical analysis demonstrating that the proposed neural architecture defines a Lipschitz continuous operator in the waveform domain (Section II-C).

II-A Overview of the Neural Architecture

We would like to design an architecture capable of embedding redundancies into the representation, thereby avoiding significant overlaps between positioning of different phonetic units while allowing for a fair amount of additive noise and distortion at inputs. Motivated by this, we extract information relevant for discrimination between phonetic units via a parametric Parzen convolutional block (Section II-B) that decomposes a waveform frame into frequency sub-bands, thereby embedding the signal into a high-dimensional space of high-resolution spectro-temporal patterns (illustrated in Fig. 1, parznets 1d). A notable difference compared to non-adaptive feature extraction operators (fbank and mfcc) is the use of a relu activation function instead of the modulus (squared) non-linearity. Mallat [25] has demonstrated that this change in activation function does not affect the theoretical properties of such operators. Moreover, it has been established recently that neural networks with relu activations realize piecewise linear functions and we therefore use that non-linearity throughout the network [26]. The main motivation behind this choice is to avoid further confounding effects between signal and noise that might otherwise arise from additional sources of non-linearity in the automatic feature extraction process (it is well known, for example, that the effects of channel noise can be amplified by non-linearities). To extract relevant patterns from such a sub-band decomposition/representation, we rely on standard non-parametric convolutional filters and pass the Parzen sub-bands to double convolutional blocks with $5$ sample long filters (see conv-conv in Fig. 1). The gradual compression of the spectro-temporal representation is achieved by applying the max pooling operator with size $3$ (after each pair of non-parametric convolutional blocks). Previous work [27] has demonstrated that a composition of convolution with max pooling tends to provide approximate local time-translation invariance. In our preliminary experiments, we investigated the effectiveness of $\max$ and (weighted) $\ell_{p}$ average pooling operators, and observed that the former works the best in combination with relu activations. The features extracted by the last convolutional block are passed to an mlp block with three hidden layers (i.e., fully connected layers denoted by fc in Fig. 1), followed by a softmax output block.

II-B Parzen Block for Sub-band Decomposition of Speech Signals

It has been demonstrated recently that feature extraction operators that combine band-pass filtering with the modulus (square) non-linearity and (weighted) local averaging are approximately locally translation invariant and Lipschitz continuous [28]. A potential shortcoming of these operators is the fact that filter parameters are selected a priori without relying on data. As a result, the hypothesis space is selected beforehand and does not necessarily provide an ideal inductive bias for all learning tasks. Moreover, the power spectral averaging that is characteristic of these operators is typically performed over speech segments of $25$ or $32$ ms [28, 29], which could be compressing the relevant information too fast into the resulting features. As a result of such compression, the feature extraction operator might be discarding the information relevant for robustness. Motivated by this, we have designed the Parzen convolutional block to tackle these shortcomings. In particular, the block does not rely on a priori selected filters but learns these via parametric convolutions that have a strongly encoded inductive bias. Moreover, the adaptive Parzen convolutional block embeds a waveform frame into a structured high dimensional space rather than compressing it into a small number of features. The latter is an important difference compared to mfcc and fbank coefficients, which do not focus on embedding redundancies into the representation. As explained above, the Parzen sub-band decomposition is followed by a gradual compression of the representation using a combination of convolutional and max pooling operators.

In speech recognition, band-pass filtering of signals is traditionally performed by (weighted) averaging of power spectra [see 5, 30] computed over speech frames of fixed duration. Alternatively, the signal can be convolved with a filter directly in the time domain. To that end, we consider a family of differentiable band-pass filters based on cosine modulations of compactly supported Parzen windows [31]. In particular, we employ the squared Epanechnikov window function given by

[TABLE]

where $\gamma$ is a parameter controlling the window width, and implicitly its frequency bandwidth. The filter can be made more frequency selective by increasing its exponent (illustrated above with the square operator), which is a consequence of increasing its order of differentiability. To allow for flexible placement of the center frequency we rely on cosine modulation. Thus, Parzen filters are defined with only two differentiable parameters, $\eta$ controlling the modulation frequency and $\gamma$ controlling the filter bandwidth:

[TABLE]

As illustrated in Fig. 1 (the leftmost panel), for each filter configuration $\left\{{\left({\eta_{i},\gamma_{i}}\right)}\right\}_{i=1}^{B}$ , we use Eq. (2) to generate a one-dimensional convolutional filter with maximum length given by the number of samples in $25$ ms of speech; filters with shorter support are symmetrically padded with zeros. The outputs of parametric convolutions are concatenated into a high dimensional spectro-temporal decomposition of a signal and then passed to a max pooling operator, followed by layer normalization [32]. As all of the operations in this parametric block are differentiable, it is possible to construct an auto-differentiation graph that seamlessly provides gradients with respect to parameters of Parzen filters. In comparison to wavelet filters [33], the Parzen convolutional block offers additional flexibility by allowing independent control over bandwidth and modulation frequency. Moreover, the block optimizes for the positioning of the two parameters while having the parametric form of the filter factored into the optimization. This can be seen as a more flexible approach compared also to the two-step procedure employed by [23], where filter cut-off frequencies are optimized with respect to a fixed-length rectangular window, and then a Hamming window is superimposed to suppress the ripple effects.

II-C Lipschitz Continuity of the Operator Mapping

We start with a review of Lipschitz continuity for operator mappings and properties relevant for their robustness. Following this, we demonstrate that the principle for the design of neural architectures outlined in Section II-A and Fig. 1 defines a Lipschitz continuous operator in the waveform domain.

Let $\mathcal{L}\left({\mathbb{R}}\right)$ denote the space of square integrable functions defined on $\mathbb{R}$ and consider a continuous signal $f\in\mathcal{L}\left({\mathbb{R}}\right)$ . An operator $\Phi\colon\mathcal{L}\left({\mathbb{R}}\right)\rightarrow\mathcal{H}$ is a mapping of a signal into a Hilbert space $\mathcal{H}$ . Let $T_{c}f\left({t}\right)=f\left({t-c}\right)$ denote the translation of a signal $f$ by some constant $c\in\mathbb{R}$ . An operator $\Phi$ is called translation invariant if $\Phi\left({T_{c}f}\right)=\Phi\left({f}\right)$ for all $f\in\mathcal{L}\left({\mathbb{R}}\right)$ and $c\in\mathbb{R}$ . The spectrogram of a signal is an operator designed to capture variations in the power spectrum over time. It can provide an approximately locally time-translation invariant representation over durations limited by a window [28]. While the spectrogram of a signal can provide local time-translation invariance, Mallat [29] has demonstrated that it does not necessarily provide stability to the action of a small diffeomorphism (e.g., speed perturbation of an utterance). Let $D_{\tau}\colon\mathcal{L}\left({\mathbb{R}}\right)\rightarrow\mathcal{L}\left({\mathbb{R}}\right)$ be a diffeomorphism of a signal (i.e., invertible function that maps one differentiable manifold to another such that both the function and its inverse are smooth) given by $D_{\tau}f\left({t}\right)=f\left({t-\tau\left({t}\right)}\right)$ , where $\tau\left({t}\right)\in\mathcal{C}^{2}\left({\mathbb{R}}\right)$ is a displacement field and $\mathcal{C}^{2}\left({\mathbb{R}}\right)$ denotes the space of twice continuously differentiable functions over the reals. For example, one can take $\tau\left({t}\right)=\epsilon t$ with $\epsilon\in\mathbb{R}$ and $\epsilon\rightarrow 0$ . To preserve stability relative to a small diffeomorphism of a signal, it is sufficient to ensure that the operator $\Phi$ is Lipschitz continuous [29, 28]. A translation invariant operator $\Phi$ is Lipschitz continuous with respect to actions of $\mathcal{C}^{2}$ -diffeomorphisms if for any compact $\Omega\subset\mathbb{R}$ there exists a constant $L$ such that for all signals $f\in\mathcal{L}\left({\mathbb{R}}\right)$ supported on $\Omega$ and all $\tau\in\mathcal{C}^{2}\left({\mathbb{R}}\right)$ it holds that [for more details see, e.g., 29]

[TABLE]

where $\mathbb{I}$ denotes the identity mapping. The Lipschitz continuity of operator $\Phi$ implies invariance to local translations and/or signal warping by a diffeomorphism $\tau\left({t}\right)$ , up to the first and second order deformation terms [29]. Such signal perturbations typically come as a result of variability in speech production and differences between speakers. Another aspect of robust representations is the ability to withstand a fair amount of additive and channel/linear noise. It is easy to show (e.g., using the convolution theorem) that such a perturbation of a clean speech signal amounts to a linear transformation of its representation in the frequency domain. Thus, an operator that is Lipschitz continuous over the sub-band decomposition of a signal has the potential to work effectively on noisy speech. In particular, a noise corrupted signal is a linear transformation of the clean signal in the frequency domain and will be contained within a ball of constant radius centered at the clean signal. An operator that is Lipschitz continuous over the frequency representation of a signal will exhibit small variations over such balls and, thus, it can provide stability relative to additive and channel noise. It is, however, important to point out that the robustness of such an operator quantitatively depends on the value of the Lipschitz constant.

The operator defined by our neural network maps a frame of raw speech $x\in\mathbb{R}^{d}$ into a vector of pre-softmax outputs $z\in\mathbb{R}^{s}$ , where $d$ is the number of samples in the input frame and $s$ is the dimension of the pre-softmax representation. Moreover, this is achieved by having an intermediate representation of the signal in the frequency domain via sub-band decomposition performed by the Parzen block. The operator mapping can be expressed as a composition of functions

[TABLE]

where $\rho_{i}$ represents the relu activation function, linear or pooling operator. In particular, the building blocks of our architecture are fully connected and convolutional layers, which are both linear operators and can be realized as matrix-vector multiplications [see, e.g., 34]. For a fully connected block with weights $W$ and bias $b$ , the Lipschitz constant $L$ is given by

[TABLE]

Thus, the minimal value of the Lipschitz constant is equal to $L=\sup_{z\in\mathcal{B}}\nicefrac{{\left\|{Wz}\right\|_{2}}}{{\left\|{z}\right\|_{2}}}$ , where $\mathcal{B}$ is a ball of constant radius containing all the layer inputs in its interior. The convolution blocks can also be realized via matrix-vector multiplications using doubly block circulant matrices [34]. Thus, a good Lipschitz constant can be obtained by keeping low the upper bounds on the weights in linear blocks and convolutional filters, while at the same time optimizing for the operator mapping such that the sub-phonetic units are linearly separable.

Gouk et al. [34] have demonstrated that the relu activation function is Lipschitz continuous with constant one. This activation function is also monotonic and, thus, defines a contraction. The same holds for the $\max$ operator used for signal pooling, as demonstrated with the following proposition.

Proposition 1.

The max pooling operator is a Lipschitz continuous function with constant one.

Proof.

The $\max$ pooling operator can be expressed as $\iota(z)=\max_{1\leq j\leq k}\sigma_{i,j}(z)$ , where $\sigma_{i,j}(z)$ is the $j$ -th output of the $i$ -th network layer that takes a vector $z$ as input. We will show that

[TABLE]

We can, without loss of generality, assume that $\iota(z)\geq\iota(z^{\prime})$ . Denote $j_{0}=\operatorname*{arg\,max}_{1\leq j\leq k}\sigma_{i,j}(z)$ . Then,

[TABLE]

∎

As the proposed neural architecture is defined using a composition of Lipschitz continuous functions, the resulting operator mapping is also Lipschitz continuous. To ensure that the training procedure selects a good Lipschitz constant, we propose to use a probabilistic parametrization for our network and learn the corresponding parameters using stochastic variational inference, as described in the next section.

While Lipschitz continuity of neural architectures has already been associated with robust representation learning [e.g., see 34], this is the first work that provides an explanation for possible advantages of the filterbank over sample-based audio processing. In particular, in order to learn an effective (relative to longer time-shifts, additive and channel/linear noise) waveform-based representation of speech signals one can design the neural architecture as a Lipschitz continuous operator in the waveform-domain, with an intermediate representation in the frequency domain that can be realized using a sub-band decomposition of the signal (the Parzen block in our case).

III Learning Parznets using Stochastic Variational Inference

In deterministic neural networks, parameters/weights are real-valued and one performs inference by optimizing a loss function over them. Performing inference in stochastic/probabilistic neural networks, on the other hand, requires a posterior distribution over parameters given data [17]. For a fixed setting of weights, a deterministic neural network with softmax outputs models the conditional probability of a categorical label $y\in\mathcal{Y}$ given an instance $x\in\mathcal{X}$ using an exponential family model [e.g., see 35, 36]. In stochastic networks, it is further assumed that weights have a prior distribution $p_{r}\left({\Delta\mid\eta}\right)$ , where $\Delta$ denotes all the parameters in the network and $\eta$ are prior hyper-parameters. The posterior distribution of neural network parameters conditioned on a set of iid examples $\{(x_{i},y_{i})\}_{i=1}^{n}$ with $X_{n}=\left\{{x_{i}}\right\}_{i=1}^{n}$ and $Y_{n}=\left\{{y_{i}}\right\}_{i=1}^{n}$ is typically given by an analytically intractable integral, with parameter-specific posterior probabilities $p\left({\Delta\mid X_{n},Y_{n}}\right)$ satisfying

[TABLE]

Variational inference [15, 16, 17, 37] is a technique for the approximation of posterior distributions involving analytically intractable integrals. It works by introducing a family of variational probability density functions $q\left({\Delta\mid\mu,\sigma}\right)$ , with $\mu$ and $\sigma$ denoting variational parameters, such that a set of these specifies a family of probability distributions. Typically, the variational family is parametrically much simpler than the posterior distribution over network parameters $p\left({\Delta\mid X_{n},Y_{n}}\right)$ . The main idea is to approximate the posterior $p\left({\Delta\mid X_{n},Y_{n}}\right)$ by optimizing a lower bound on the log-marginal likelihood of the model over the parameters of the variational distribution

[TABLE]

where $\mathcal{Q}$ is a family of variational distributions specified by domains of parameters $\mu$ and $\sigma$ . The Gaussian mean field approximation assumes that the variational distribution is the product of univariate Gaussian distributions, i.e. $q\left({\Delta\mid\mu,\sigma}\right)=\prod_{i=1}^{p}\ \mathcal{N}\left({\Delta_{i}\mid\mu_{i},\sigma_{i}^{2}}\right)$ , where $p$ is the total number of parameters in the model, $\Delta_{i}$ is the $i$ -th component of the parameter vector $\Delta$ , and $\mathcal{N}\left({\Delta_{i}\mid\mu_{i},\sigma_{i}^{2}}\right)$ is a univariate Gaussian distribution of $\Delta_{i}$ with mean $\mu_{i}$ and variance $\sigma_{i}^{2}$ .

The expected log-likelihood of the model

[TABLE]

is analytically intractable and an evaluation of this expectation is required for the forward-pass when computing the loss function for a setting of the variational parameters $\mu$ and $\sigma$ . Stochastic variational inference approximates this term in the forward-pass by sampling the variational distribution [37]:

[TABLE]

with $\Delta_{j}=\mu_{j}+\epsilon_{j}\sigma_{j}$ being a sample from $\mathcal{N}\left({\Delta_{j}\mid\mu_{j},\sigma_{j}^{2}}\right)$ given by $\epsilon_{j}\sim\mathcal{N}\left({\epsilon_{j}\mid 0,1}\right)$ ( $1\leq j\leq p$ ), and where $\left\{{\left({x_{i},y_{i}}\right)}\right\}_{i=1}^{m}$ is a mini-batch with $m$ random examples. As illustrated in Fig. 1 (the rightmost panel), the parameters of the neural network are populated with a random sample $\Delta$ drawn from the variational distribution and with that setting one computes the loss function for a particular mini-batch. The forward-pass sequence of actions is differentiable with respect to the variational parameters $\upsilon=\left\{{\left({\mu_{i},\sigma_{i}}\right)}\right\}_{i=1}^{p}$ and unbiased. Consequently, the gradient of this estimator is also unbiased and can be computed in the backward-pass by $\nabla_{\upsilon}L_{n}\left({q}\right)\approx\nicefrac{{n}}{{m}}\sum_{i=1}^{m}\nabla_{\upsilon}\log p\left({y_{i}\mid x_{i},\Delta}\right)$ , where the network parameters $\Delta$ originate from the forward-pass components and are given by $\Delta_{j}=\mu_{j}+\epsilon_{j}\sigma_{j}$ . Thus, stochastic neural networks update the variational mean and variance parameters during gradient descent and use back-propagation for the computation of the gradients with respect to these parameters. At test time, the parameters of neural architecture are populated with variational means. In this way, a stochastic neural network injects randomness into network parameters for each mini-batch. As a result, the inferred model can capture parameter uncertainty and is likely to be more stable to parameter perturbations than an equivalent deterministic model. A further regularization effect can be achieved via the Kullback–Leibler divergence term (Eq. 4), discussed in the next section.

III-A Approximation of Kullback–Leibler Divergence

The Kullback–Leibler divergence term is responsible for regularization (Eq. 4) and it is defined in terms of an analytically intractable integral that is typically approximated by Monte Carlo estimates using samples from the variational distribution [15] or prior specific second order approximations [37, 38]. We propose an approximation scheme based on the Gauss–Hermite quadrature, which independently of the prior distribution used allows for an approximation with a polynomial of arbitrarily high degree. More specifically, variational inference typically relies on Gaussian mean field approximations and this implies that the divergence term can be expressed as a sum of one dimensional integrals with respect to univariate Gaussian measures. Such integrals can be effectively approximated using the Gauss–Hermite quadrature [39], which is a quadrature with the weight function $\exp(-u^{2})$ over the interval $u\in(-\infty,\infty)$ . The following theorem provides a formal specification of the Gauss–Hermite quadrature for univariate functions.

Theorem 2.

[Abramowitz and Stegun, 39]* For a univariate function $h$ and an integral*

[TABLE]

the Gauss-Hermite approximation of order $s$ satisfies $\mathcal{J}\approx\sum_{i=1}^{s}w_{i}h\left({u_{i}}\right)$ , where $\left\{{u_{i}}\right\}_{i=1}^{s}$ are the roots of the physicist’s version of the Hermite polynomial $H_{s}\left({u}\right)=\left({-1}\right)^{s}\exp\left({u^{2}}\right)\frac{\,\mathrm{d}{}^{s}}{\,\mathrm{d}{u^{s}}}\exp\left({-u^{2}}\right)$ and the corresponding weights $\left\{{w_{i}}\right\}_{i=1}^{s}$ are given by $w_{i}=\frac{2^{s-1}s!\sqrt{\pi}}{s^{2}H_{s-1}\left({u_{i}}\right)^{2}}$ .

Such approximations have been studied theoretically, with convergence rates provided for polynomials and functions of limited regularity. More specifically, the Gauss–Hermite approximation of order $s$ is exact and, thus, optimal for all polynomials of degree $2s-1$ or less [39]. For functions $h\in\mathcal{C}^{2s}$ , the error of the Gauss–Hermite quadrature is given by [40]

[TABLE]

where $\hat{u}\in\left({-\infty,\infty}\right)$ . Xiang and Bornemann [41] have studied convergence rates of the Gaussian quadrature for functions of limited regularity. The regularity of an integrand is expressed via the decay rate of its expansion coefficients in the basis formed by the Chebyshev polynomials of the first kind. In particular, if the expansion coefficients $a_{i}\in\mathcal{O}(i^{-p-1})$ for some $p>0$ (where $a_{i}$ corresponds to the Chebyshev polynomial of the $i$ -th degree) then the error of the quadrature approximation of order $s$ can be upper bounded by $\mathcal{O}(s^{-p-1})$ for $p>2$ . For $0<p<2$ , on the other hand, the guaranteed convergence rate is sightly slower and can be upper bounded by $\mathcal{O}(s^{-\nicefrac{{3p}}{{2}}})$ . These results can provide theoretically well founded guidelines for selecting the approximation order and quantify the trade-offs between approximation quality and computational costs.

III-B Illustration of the Approximation Scheme with Two Priors

In [37], it has been argued that log-scale uniform priors provide a theoretical justification for the dropout regularization technique [42] frequently used in the training of neural networks. The Bayesian aspect of that justification has recently been disputed in [43] but the technique can still be viewed as performing penalized log-likelihood estimation with the Kullback–Leibler divergence term acting as regularizer. The prior is given by $p_{r,\mathrm{lsu}}\left({\log\left|{\Delta_{i}}\right|}\right)\propto\mathrm{const},$ or equivalently $p_{r,\mathrm{lsu}}\left({\left|{\Delta_{i}}\right|}\right)\propto\nicefrac{{1}}{{\left|{\Delta_{i}}\right|}}$ , where $\Delta_{i}$ is some network parameter. Two different second order approximations of the Kullback–Leibler divergence between Gaussian mean field posteriors and this prior distribution were provided in [37] and [38]. We propose an alternative Gauss–Hermite approximation, formalized in the following proposition. Just as in [42] and [37], we employ a parametrization of variational Gaussian mean field known as the dropout posterior, with mean parameter $\mu_{j}$ and variance $\sigma_{j}^{2}=\alpha_{j}\mu_{j}^{2}$ specified via a scaling parameter $\alpha_{j}>0$ (for all $1\leq j\leq p$ ).

Proposition 3.

The kl divergence between a Gaussian distribution with the dropout parametrization of variance and a log-scale uniform prior can be approximated by

[TABLE]

where $v_{i}=\sqrt{2\alpha}u_{i}+1$ (for all $1\leq i\leq s$ ) and the $\left\{{u_{i}}\right\}_{i=1}^{s}$ are roots of the Hermite polynomial with corresponding quadrature weights $\left\{{w_{i}}\right\}_{i=1}^{s}$ .

Proof.

From [37, Appendix C], we know that the Kullback–Leibler divergence term is given by

[TABLE]

The expectation with respect to the Gaussian random variable $\epsilon$ can be re-written as

[TABLE]

The result now follows from Theorem 2 by taking $h\left({t}\right)=\log\left|{\sqrt{2\alpha}t+1}\right|$ . ∎

The scale-mixture is another prior distribution frequently used in variational inference, first proposed in [15]. It resembles the so called spike and slab prior [44, 45, 46] and is given by

[TABLE]

where $\Delta_{i}$ is a parameter of the model (see Eq. 4), $\eta_{1}^{2}$ and $\eta_{2}^{2}$ are prior (variance) hyper-parameters with $\eta_{1}\ll\eta_{2}$ , $\xi$ is the prior mean, and $0\leq\lambda\leq 1$ is the mixture scale. The hyper-parameters of the prior distributions (i.e., $\eta_{1}$ , $\eta_{2}$ , $\lambda$ , and $\xi$ ) are kept fixed during optimization and can be chosen via cross-validation. The first mixture component is chosen such that $\eta_{1}\ll 1$ , which forces many of the variational parameters to concentrate tightly around the prior mean $\xi$ (e.g., around zero for $\xi=0$ ). The second mixture component has higher variance and heavier tails allowing parameters to move further away from the mean. The prior variance hyper-parameters are shared between all the network parameters and this is an important difference compared to approaches based on the spike and slab prior [46, 45, 44], where each model parameter has a different prior variance. The following proposition provides means for approximating the divergence term between a Gaussian mean field variational distribution and this prior function.

Proposition 4.

The kl divergence between a Gaussian distribution with the dropout parametrization of variance and a scale-mixture prior can be approximated by

[TABLE]

where $v_{i}=\left({\sqrt{2\alpha}u_{i}+1}\right)\mu$ and the $\left\{{u_{i}}\right\}_{i=1}^{s}$ are roots of the Hermite polynomial with corresponding quadrature weights $\left\{{w_{i}}\right\}_{i=1}^{s}$ , $\alpha$ and $\mu$ are variational parameters, and $p_{r,\mathrm{sm}}$ is some scale-mixture prior distribution.

Proof.

We can re-write the divergence term as

[TABLE]

where $H\left({q}\right)$ denotes the entropy of the univariate Gaussian distribution given by

[TABLE]

As the entropy of a Gaussian distribution defines an analytically tractable integral [e.g., see 47, 48], we have that the entropy of $q$ is given by

[TABLE]

On the other hand, the expected log-likelihood of the scale-mixture prior can be approximated using the Gauss-Hermite quadrature by observing that

[TABLE]

The result now follows from Theorem 2 by taking $h\left({t}\right)=\log p_{r,\mathrm{sm}}\left({\sqrt{2\alpha\mu^{2}}t+\mu}\right)$ . ∎

IV Related Work

An alternative to learning a discriminative model with non-adaptive features is to learn these features automatically as part of a neural architecture that takes raw speech as input. In addition to having a more flexible inductive bias such a model would be less susceptible to the information loss that is inherent to waveform compression by means of a projection to a lower dimensional feature space [9, 49]. In particular, a model operating directly in the waveform domain has the potential to exploit local correlations within the signal that are typically discarded when computing Mel-filter bank values [50], as well as the information contained in a sequence of waveform samples without interruptions by frame boundaries characteristic to spectrograms and non-adaptive feature extraction techniques based on frame-based discrete Fourier transforms [51]. As a result of the latter, phonetic events on the boundaries of short frames are typically poorly described by filterbank features.

Whilst speech production embeds redundancies relevant for robustness, there are several challenges when dealing with these highly correlated raw speech inputs. In particular, the high dimensionality of waveform signals typically requires a larger number of parameters compared to standard features and a prolonged training time. Another difficulty is the fact that raw speech is known to be characterized by a large number of variations such as temporal distortion and speaker variability [11, 24]. Acoustic models based on neural networks operating directly in the waveform domain are, thus, likely to over-fit on small and moderately sized datasets without appropriate inductive bias. In this sense our approach, which combines variational inference with Lipschitz continuity of the operator mapping, provides a theoretical underpinning for the design and learning of effective waveform-based acoustic models. Previous work has also resorted to similar techniques for maintaining the balance between dataset size and model complexity. Watanabe et al. [19, 52] have used variational inference for clustering of states in triphone hidden Markov models (hmm) and learning the appropriate number of components in Gaussian mixture models (gmm). In contrast to this, we use variational inference to learn a stochastic convolutional network that models the conditional probability of a triphone state-id given an input waveform frame.

Graves [17] and Braun and Liu [20] have used variational inference to learn a recurrent neural network as part of an end-to-end acoustic model. While the latter approach does not have an explicit kl divergence term characteristic to variational inference, there is a sparsity inducing penalty over the parameters defining standard deviations, which under a suitable prior could be seen as an instance of kl divergence. In both of these works it was observed that parameter uncertainty is correlated with the importance of individual parameters for the speech recognition tasks considered. Similarly, Hu et al. [53] have proposed a Bayesian neural network that allows for learning with more expressive activation functions in the context of multi-layer perceptrons and standard recurrent neural networks. In particular, each hidden layer of the model relies on Bayesian averaging relative to a weight prior when computing the corresponding outputs, and variational inference for dealing with the resulting analytically intractable integrals. A Bayesian approach coupled with variational inference has also been used in [54] for speaker adaptation. The main difference to this line of work is that neither of those models operates in the waveform domain, but rely on low-dimensional feature spaces generated by fbank or mfcc features. This allows for scalable inference of recurrent models, which is known to be computationally expensive for high dimensional inputs such as waveform signals. Moreover, prior work in speech recognition (to the best of our knowledge) considers variational inference independently of Lipschitz continuity and other design principles that could allow for learning of robust models in small scale settings. Recently, an approach for modulation filter-learning based on an encoder-decoder architecture and variational inference has been considered in [55] and [56]. The encoder takes as input a Mel-spectrogram constructed using speech segments of fixed length and learns its latent representation. The optimization of encoder-decoder parameters is performed using variational inference and the learned filters are then used to generate features that are used as input to an mlp. In contrast to this, we use variational inference to learn filters jointly with other network parameters (i.e., filterbank-based feature extraction/learning is not done independently of training other network modules).

A common characteristic of previous approaches for waveform-based speech recognition is the use of relatively large datasets [11, 12]. In such a regime, waveform-based acoustic models are competitive with architectures relying on standard features (i.e., mfcc, fbank, and fmllr). Another difference compared to our approach is that previous architectures typically employ a convolutional layer with weighted $\ell_{1}$ or $\ell_{2}$ pooling ( $25$ ms long frames) to emulate filterbank features and reduce the dimension of the representation quickly [50, 57]. In contrast to this, we perform gradual compression of the waveform sub-band decomposition via max pooling and thus overcome the information loss inherent in standard features. Moreover, we use the relu non-linearity throughout the network and do not apply the log operator to the outputs of the initial block. Sainath et al. [11] propose an architecture that takes raw speech inputs and applies one-dimensional convolutions first in the time-domain and then the frequency-domain, designed to extract band-pass features from the waveform. The architecture itself is a recurrent net that requires more than $2,000$ hours of training data to match the performance of models with standard features. Similarly, Zhu et al. [12] combine two convolutional layers with recurrent blocks in end-to-end training, requiring more than $2,400$ hours of training data for state-of-the-art results. Ghahremani et al. [24] proposed a feedforward architecture based on a convolutional feature extraction layer, with the outputs of that block passed to a deep time-delay neural network (tdnn). The empirical results indicate that the approach is competitive with mfcc-based architectures on large datasets. It has not been evaluated on noisy speech and it is unclear how well it would generalize from small datasets.

Our architecture performs parametric sub-band decomposition of speech waveforms and it is most closely related to sincnet [23], which employs three 1d convolutional layers on top of the parametric block. sincnet is considered to be the state-of-the art model for waveform-based speech recognition. A related architecture is sinc2net; this links a parametric convolution block to an mlp [58]. Recently, complex-valued parametric filters have been used to initialize a complex non-parametric convolution block in a deep network for end-to-end speech recognition [59, 60, 61]. In comparison to [59], we show that our approach generalizes better on the small timit dataset. In our experiments, we use the sincnet architecture (code available) as a representative baselines from this class.

Recently, an approach based on concatenation of multiple convolutional blocks was proposed [22], in which convolutional blocks capture different contexts in time and learn band-pass filters that are more expressive than classic Mel-filterbanks, which operate on a single fixed context. The approach was evaluated on both noisy and conversational speech. In our experiments, we compare to this baseline and demonstrate statistically significant improvement on the ami-ihm dataset ( $12\%$ relative).

V Experiments

We evaluate the proposed approach with a series of experiments on three different datasets: timit [62], aurora4 [63], and ami-ihm [64]. In all the experiments111A detailed setup of our experiments along with the source code can be found in the project repository https://bitbucket.org/doglic/asr/., we train a context dependent hybrid hmm model based on frame labels (i.e., hmm state ids) generated using a triphone model from Kaldi [65] with $25$ ms frames and $10$ ms stride between the successive frames. The data splits (train/validation/test) originate from the Kaldi framework. In the pre-processing step, we assign the Kaldi frame label to the $200$ ms long segment of raw speech centered at an original Kaldi frame (keeping $10$ ms stride between the successive frames of raw speech). To be consistent with our baselines on timit, we generate frame labels using the dnn triphone model and decoding configuration from [23]. For aurora4, on the other hand, we generate frame labels using both gmm and dnn triphone models, relying on the default decoder configuration from Kaldi.

We describe below four sets of experiments. The first aims at demonstrating the impact of particular design choices on the effectiveness of acoustic models. More specifically, our empirical results show that: modulation filter learning can improve the performance of acoustic models in a statistically significant way (subsection A, below), the proposed approximation scheme for the Kullback–Leibler divergence term is generally more effective than previous approaches (subsection B, below), modulation filter learning moves away from the initial solution and converges to different distributions of modulation frequencies for different learning tasks (subsection C, below), and probabilistic parametrization of the neural architecture contributes to a $7.4\%$ relative improvement in the error rate compared to the deterministic one (subsection D, below). The second set of experiments (subsections D and E, below) is aimed at showing that the proposed approach does not over-fit on what is considered to be a small dataset in speech recognition (i.e. timit). Moreover, the results also indicate that a combination of variational inference and Lipschitz continuous architectures for waveform-based speech recognition such as parznets does not require large training datasets to outperform models based on standard filterbank features. The third experiment (subsection E, below) deals with noisy speech and shows that the proposed approach can learn an effective noise robust representation of waveform signals. The fourth and final experiment aims at demonstrating the effectiveness of the proposed approach on conversational speech (i.e., ami-ihm), with approximately $80$ hours of audio. The experiment shows a clear improvement over recently proposed waveform-based approaches ( $12\%$ relative) and a competitive performance relative to filterbank architectures known for their effectiveness on this dataset. We also observe that variational inference consistently contributes to an improvement in the error rate compared to the deterministic models.

V-A Can modulation filter learning improve the effectiveness of waveform-based acoustic models?

The goal of this experiment is to demonstrate that filter optimization can be more effective than non-adaptive filtering of speech signals, in a way that is statistically significant. To that end, we train two neural networks with identical architectures (see Fig. 1) using variational inference with the Kullback–Leibler divergence term approximated via the Hermite–Gauss (hg) quadrature: i) a neural network with non-adaptive Parzen filters initialized just as in Mel-frequency coefficients (denoted with mel-filters in Tables I and II), and ii) the joint filter and neural network learning proposed in this work (see adaptive filters, hermite-gauss quad. under log-scale uniform prior vi in Tables I and II). The Parzen filters of the latter adaptive operator are initialized exactly as the non-adaptive ones. To assess whether one method performs statistically significantly better than the other on timit, we perform the paired Welch t-test [66] based on $5$ repetitions of the experiment. The t-test indicates that filter learning is with $90\%$ confidence statistically significantly better than non-adaptive filtering. We similarly studied performance on aurora4, which is a much larger dataset than timit where repeated training is time consuming and expensive. However, the dataset contains $14$ different test samples and this allows us to employ the Wilcoxon signed rank test [67, 68] to again establish whether one approach is statistically significantly better than the other. The test indicates that filter learning is with $95\%$ confidence statistically significantly better than non-adaptive filtering on aurora4 (see e.g. Table II).

V-B How effective is the Gauss–Hermite approximation scheme?

Having established that modulation filter learning can be significantly better than static filtering, we proceed to show that Hermite–Gauss quadrature is an effective scheme for the approximation of the Kullback–Leibler divergence term acting as a regularizer in variational inference. In particular, we compare the effectiveness of neural networks learned via variational inference and existing strategies for approximation of the Kullback–Leibler divergence term, defined using the log-scale uniform [38] and scale mixture priors [15]. Table I (see squared epanechnikov modulation filters, test sample) provides the results on timit and shows that the approximation based on the Hermite–Gauss quadrature (see hermite-gauss quad. columns) is on average better than existing approximation schemes (see molchanov et al. and mcmc columns). However, the Welch t-test does not show a statistically significant improvement of the Hermite–Gauss quadrature over the alternatives on this dataset. Table II summarizes our results on aurora4 and demonstrates a significant improvement over the baselines when using the Hermite–Gauss quadrature to approximate the Kullback–Leibler divergence term. More specifically, the Wilcoxon signed rank test in the case of log-scale uniform prior shows that the approximation based on the Hermite–Gauss quadrature is with $95\%$ confidence statistically significantly better than the state-of-the-art approximation proposed in [38].

V-C Do modulation frequencies move away from the initial solution and converge to different distributions for different learning tasks?

The goal of this experiment is to demonstrate that the optimization of modulation filters changes the initial distribution of modulation frequencies and bandwidths. Fig. 2 provides a comparison of kernel density estimators for modulation frequencies and filter bandwidths. From the figure, it is evident that the initial and optimized distributions are quite different for filter bandwidths on both datasets. Moreover, there is an interesting difference between the distributions of modulation frequencies between timit and aurora4 datasets, which might be due to multi-condition training and various noise conditions characteristic to aurora4.

V-D How does the approach fare relative to state-of-the-art feedforward models on timit?

Table III summarizes our empirical results in comparison to state-of-the-art feedforward architectures on timit. In addition to the lowest obtained error rate (denoted with min), we also report the average result over $5$ simulations. A comparison to previously reported results for waveform-based speech recognition indicates that our approach performs the best on average on this task. Moreover, this is the first such approach that outperforms all the feedforward architectures built on top of standard non-adaptive features. Our results also show that variational inference contributes to a $7.4\%$ relative improvement on this dataset over a deterministic network with identical architecture (see deterministic parznets in Table III). We note here that recent work has reported lower error rates on timit using recurrent nets and statically extracted features. In particular, [72] reports the following error rates for gated recurrent units (gru): li-gru $15.8\%$ and li-gru fmllr $14.8\%$ . In the waveform domain with low-resources (i.e., small datasets such as timit) recurrent nets perform worse than feedforward models. In particular, our best result on this dataset with recurrent nets in the waveform domain was $18.8\%$ , which is significantly worse than the best observed result with parznets (i.e., $16.2\%$ ). The good performance of models based on fmllr features should not come as a surprise, because that feature extraction technique performs speaker and domain adaptation as well. Our future work will explore recurrent architectures in the waveform-domain, combined with regularization mechanisms provided by variational inference.

V-E How does the approach fare relative to state-of-the-art feedforward models on aurora4?

aurora4 is a medium vocabulary task based on clean speech from the Wall Street Journal (wsj0) corpus [73]. The clean speech was corrupted by six different noise types at different snrs. The test sets consist of noise corrupted utterances recorded by a primary and a secondary microphone. In Table IV we provide a summary of our results on this dataset relative to state-of-the-art feedforward architectures. The first experiment compares our approach (8 x cnn1d) to the state-of-the-art architecture for waveform-based speech recognition [23, sincnet] and shows a statistically significant [68, 67, Wilcoxon test, $95\%$ confidence] improvement over that baseline. We also compare to a recent approach for modulation filter-learning using encoder-decoder architecture and variational inference [55, 56]. The results again show (with $95\%$ confidence) that the proposed approach is statistically significantly better than the baseline from [55, 56]. Following this, we compare our results to the error rates reported in [21] for $8$ and $10$ -layer deep 2d convolutional networks (vdcnn2d) based on statically extracted features using $200$ ms long raw-speech segments (i.e., $17$ fbank frames). This might be an unfair comparison to our approach, because we use the less expressive 1d convolutions in our architecture. Still, the results indicate that the variational parznets architecture with $8$ convolutional layers outperforms significantly the network with $10$ cnn2d layers from [21]. Furthermore, we extend our architecture (Fig. 1) to $10$ convolutional layers by employing time-padding in 1d convolutions to allow for another double convolutional block. The results indicate a further improvement in accuracy as a result of this modification.

Another particularly interesting observation is that the gains of our approach over noisy samples do not come as a result of performance degradation on clean speech. We note here that [21] reports a slightly better error rate with 2d convolutions and fbank features when the context size is increased to $250$ ms (i.e. $21$ frames), in combination with time and frequency padding (wer $8.81\%$ ). Table IV (see dnn alignments) shows that our approach provides a competitive error rate (wer $8.73\%$ ) with smaller context size (i.e., $200$ ms) and less expressive time-padded 1d convolutions. Moreover, a recent approach based on multi-octave convolutions and $15$ such convolutional layers has achieved the error rate of $8.31\%$ on this dataset [74].

In a follow up work [75], we have investigated parznets with 2d convolutional operators coupled with Bernoulli dropout layers (i.e. a special case of stochastic neural networks with variance parameter fixed over an entire network layer). This approach achieved a word error rate of $7.80\%$ , which is the best reported number on this dataset for waveform-based speech recognition. Here, it is important to note that 1d parznets baselines from [75] employ time-padded convolutions and an extra fully connected layer in the mlp block compared to the neural architecture considered in this paper.

In addition to waveform-based baselines and deep convolutional networks operating with standard non-adaptive features, we have also compared our approach to a junction network [71] coupled with first and second order deep scattering spectrum features (see Table IV, dss + junc. net). The latter is a non-adaptive wavelet-based feature extraction technique [28] that generates features of different orders, with the first order coefficients approximately equal to mfcc, and higher order coefficients recovering information lost at lower levels. Our experiments demonstrate that parznets can outperform this approach, even when it is supplied with utterance level normalization. In parallel with this work, we have also proposed deep scattering power spectrum features [76]. The latter non-adaptive feature extraction technique coupled with the junction neural architecture and utterance level normalization performs on par with parznets (wer $8.83\%$ ). Given that deep scattering spectrum recovers information lost at lower levels, we hypothesize that this might be yet another indication for the relevance of information loss (characteristic to standard filterbank features) for robustness to standard noise corruptions.

V-F How does the approach fare relative to state-of-the-art raw waveform baselines on ami-ihm?

ami-ihm is a conversational speech dataset with approximately $80$ hours of speech, recorded using individual headset microphones. The alignments were generated using the Kaldi recipe configured with $3,984$ hmm state ids. Table V summarizes our result relative to relevant baselines on this dataset.

We have first compared variational parznets with $8$ and $10$ convolutional layers to two recently published raw waveform approaches for this task: multi-span raw waveform models [22] and sincnet [77]. Our empirical results show that variational parznets advance the state-of-the-art in waveform-based acoustic models on this dataset, with over 12% relative improvement in wer compared to these baselines. Moreover, we also compare to deep time-delay neural networks [tdnn, 78] based on fbank features (considered to be the state-of-the-art feedforward model on this dataset) and show that variational inference coupled with a parznets architecture (10 x cnn1d) can outperform that approach. We note here that we have not used any data augmentation or $\mathrm{i}$ - $\mathrm{vectors}$ in our experiments, both techniques which could be combined with our approach and are known to further improve the accuracy on this dataset.

Finally we note that our experiments were conducted using a cross entropy (ce) loss function. Experiments using a sequence discriminative approach (lf-mmi) indicate that the wers could be further lowered – Povey et al [79] indicated that using lf-mmi in place of ce can reduce the error rate by about $10\%$ relative, and more recently a regularised lf-mmi training with significant data augmentation (6x) resulted in a wer of $18.0\%$ on this task [80].

VI Discussion

This section discusses some of the model choices and assumptions made by our approach. We also address the empirical evaluation and the ablation studies that we have performed to discern the effects of individual components of our approach.

The proposed approach employs a variational family of univariate Gaussian distributions, known as the mean field assumption. While such a variational family might be perceived as overly simplistic, recent work [81] has demonstrated that deep Bayesian/stochastic neural networks equipped with univariate Gaussian distributions can build complex covariance structures through multiple layers. The proposed neural architecture combines $8$ - $10$ convolutional layers with multi-layer perceptrons and, thus, provides sufficient depth.

The main reason for selecting the probabilistic formulation of the neural architecture is to enforce the bounded weight property across the network and, thus, allow for learning of a robust acoustic model with a good Lipschitz constant. Variational inference alone, however, is not necessary to guarantee bounded weights across the neural network. That property will depend on the choice of prior function and holds for the Gaussian and scale-mixture priors. For the log-scale uniform prior, Section III-A provides a brief discussion and reference to relevant related work where it has been demonstrated that learning with that prior amounts to performing penalized log-likelihood estimation, with the Kullback–Leibler divergence term responsible for regularization. Moreover, the dropout regularization technique [42] can be theoretically justified as variational inference with the log-scale uniform prior. Hence, the proposed approach exploits means to generalize the most frequently used regularization method for neural networks. Our experiments, however, demonstrate that Gaussian and scale-mixture priors do not provide a good inductive bias for waveform-based acoustic models. Future work will explore the potential of more complex prior functions.

In our ablation study (see Section V-A), we have compared the effectiveness of two identical architectures, one with modulation filter learning and the other with a priori fixed or non-adaptive filters. Our empirical results indicate that filter learning can be statistically significantly more effective than non-adaptive filters. Moreover, Fig. 2 shows that modulation frequencies converge to different distributions for different learning tasks and this is yet another indication that non-adaptive filters do not provide a universally optimal inductive bias. When evaluating the effectiveness of the approach relative to standard features such as fbank and mfcc one should bear in mind that different feature representations require different neural architectures and inductive biases for state-of-the-art results. Moreover, there is a significant difference in the dimension of the inputs to neural networks operating with raw waveforms on the one hand and fbank or mfcc features on the other, because of the aggressive compression performed by the latter. In addition to this, neural networks operating with statically extracted features typically encode more information into the training process by means of speaker and utterance level normalizations, which are known to improve the performance of acoustic models. To make the comparison between different feature representations fair, we have decided to compare our approach to state-of-the-art feedforward architectures operating in low-dimensional feature spaces. Tables III and IV indicate a competitive performance of our approach relative to state-of-the-art baselines based on statically extracted features. Moreover, the approach is more effective than any other waveform-based approach and in this sense advances the state-of-the-art.

We conclude with a reference to the selected filterbank, which is simple to implement and provides the band-pass properties required to establish the Lipschitz continuity of the waveform-based operator mapping. The parametrization allows for an independent control over bandwidth and modulation frequency, which is sufficient to emulate a sub-band decomposition as in standard statically extracted features. In Table III (see raw speech cnn and end-to-end cnn), we have compared to deep convolutional networks that employ modulation filter learning with a standard non-parametric convolutional layer. Our empirical results indicate that the strong inductive bias encoded via a parametric convolutional layer can lead to more effective acoustic models, especially in low-resource settings.

Conclusion

We have outlined a principled framework for learning effective waveform-based acoustic models. The framework combines stochastic variational inference with a Lipschitz continuous architecture/operator that learns to gradually extract relevant features. The approach operates directly in the waveform domain to avoid potential information loss inherent to standard feature extraction techniques such as mfcc and fbank coefficients. In our experiments, the approach outperforms recently proposed architectures for waveform-based speech recognition (e.g., sincnet) as well as a relevant deep convolutional networks for learning of robust acoustic models using fbank features [21]. Moreover, our empirical results show that the proposed approach allows for learning of effective acoustic models using relatively small datasets. Our future work will explore the potential of stochastic recurrent architectures operating in the waveform domain as well as different priors that could further improve the inductive bias via the regularization mechanism provided by the Kullback–Leibler divergence term. To the best of our knowledge, this is the first time that a variational approach has achieved results competitive with state-of-the-art on continuous speech recognition.

Appendix A Training Procedure

In all the experiments, the minibatch size was set to $256$ samples. For our deterministically trained baselines, we tried two batch sizes, $256$ and $128$ , and report the better of the two error rates in our tables. The feature extraction parameters involving Parzen filters and convolution layers that synthesize features across filtered signals were optimized using the rmsprop algorithm [82] with initial learning rate $0.0008$ . The fully connected blocks were optimized using the standard stochastic gradient descent with initial learning rate $0.08$ . This combination of optimization algorithms (with all the blocks trained jointly) has been found to be the most effective, confirming the findings in [23]. Alternative algorithms that were tried and found to be too aggressive (providing lower training error but worse generalization) were adam [83], nadam [84] and sgd with momentum. Here, it is important to note that the conclusions of our ablation studies were consistent under changes to the optimization algorithm. The learning rates were decreased by a factor of $\nicefrac{{1}}{{2}}$ if at the end of an epoch the relative improvement in validation error was below a specified threshold (e.g., $0.1\%$ for the frame classification error). Moreover, if the validation error degraded then training was continued using the model from the previous epoch (with learning rates again decreased by a factor $\nicefrac{{1}}{{2}}$ ). We terminate the training process after at most $25$ epochs or upon observing no improvement in the validation error for $3$ successive epochs.

In previous work [85, 38] it was established that, for some priors, stochastic variational inference tends to trim too many parameters in the early stages of the training. To address this issue it was proposed [85] to rescale the Kullback–Leibler regularization term with a hyperparameter $\rho_{t}$ such that $\rho_{t+1}=\min\{1,\rho_{t}+c\}$ with $\rho_{0}=0$ and some constant $0<c<1$ (e.g., $c=0.2$ ), and where $t$ denotes the epoch number (starting from $t=0$ ). We followed this heuristic in all of our experiments and observed an improvement in accuracy. Following the findings in [86], we also considered two notions of validation error in our preliminary experiments (omitted here for brevity) classification error of raw-speech frames and entropy regularized log-loss [86]. The empirical results from [86] indicate that the latter error correlates better with the token error rate of continuous speech recognition. Indeed, our best results were obtained using the entropy regularized log-loss as the validation objective. Just as in [15], we observed an improvement in accuracy for models trained using batch-specific importance weighting of the divergence term. However, the cooling schedule proposed in [15, Eq. $9$ ] was too strong for the datasets considered here because of the much larger number of batches. To address this, we replaced base $2$ proposed in [15] with another constant, computed such that the minimal importance weight is equal to machine precision for $32$ -bit floating point arithmetics. In addition to these findings we also observed that in some cases the optimization (overly) focuses on the maximization of the log-likelihood for the already correctly classified speech frames. To mitigate this and ensure that the optimization objective is always bounded, we transformed softmax probabilities (denoted with $p$ ) by

[TABLE]

with $\kappa$ denoting a small jitter constant (e.g. $\kappa=10^{-8}$ ).

Acknowledgments

This work was supported in part by EPSRC grant EP/R012067/1. The authors would also like to thank Steve Renals and Peter Bell for valuable discussions and comments that have improved the manuscript. The Kaldi alignments were generated with the help of Erfan Loweimi and Neethu Joy.

Bibliography86

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. Li, A. Trevino, A. Menon, and J. Allen, “A psychoacoustic method for studying the necessary and sufficient perceptual cues of american english fricative consonants in noise,” The Journal of the Acoustical Society of America , vol. 132, 2012.
2[2] C. Moore, T. Lee, and F. Theunissen, “Noise-invariant neurons in the avian auditory cortex: Hearing the song in noise,” PLOS Computational Biology , vol. 9, no. 3, 2013.
3[3] Z. Tüske, R. Schlüter, and H. Ney, “Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2018.
4[4] J. Bridle and M. Brown, “An experimental automatic word-recognition system,” JSRU, Ruislip, UK, Tech. Rep. 1003, 1974.
5[5] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing , 1980.
6[6] L. Alsteris and K. Paliwal, “Further intelligibility results from human listening tests using the short-time phase spectrum,” Speech Communication , vol. 48, 2006.
7[7] B. Meyer, M. Wächter, T. Brand, and B. Kollmeier, “Phoneme confusions in human and automatic speech recognition,” INTERSPEECH , 2007.
8[8] S. Peters, P. Stubley, and J.-M. Valin, “On the limits of speech recognition in noise,” IEEE International Conference on Acoustics, Speech and Signal Processing , 1999.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Learning Waveform-Based Acoustic Models using Deep Variational Convolutional Neural Networks

Abstract

Index Terms:

I Introduction

II Parznets — Deep Convolutional Neural Networks for Waveform-based Speech Recognition

II-A Overview of the Neural Architecture

II-B Parzen Block for Sub-band Decomposition of Speech Signals

II-C Lipschitz Continuity of the Operator Mapping

Proposition 1**.**

Proof.

III Learning Parznets using Stochastic Variational Inference

III-A Approximation of Kullback–Leibler Divergence

Theorem 2**.**

III-B Illustration of the Approximation Scheme with Two Priors

Proposition 3**.**

Proof.

Proposition 4**.**

Proof.

IV Related Work

V Experiments

V-A Can modulation filter learning improve the effectiveness of waveform-based acoustic models?

V-B How effective is the Gauss–Hermite approximation scheme?

V-C Do modulation frequencies move away from the initial solution and converge to different distributions for different learning tasks?

V-D How does the approach fare relative to state-of-the-art feedforward models on timit?

V-E How does the approach fare relative to state-of-the-art feedforward models on aurora4?

V-F How does the approach fare relative to state-of-the-art raw waveform baselines on ami-ihm?

VI Discussion

Conclusion

Appendix A Training Procedure

Acknowledgments

Proposition 1.

Theorem 2.

Proposition 3.

Proposition 4.