Diagonal State Space Augmented Transformers for Speech Recognition
George Saon, Ankit Gupta, Xiaodong Cui

TL;DR
This paper introduces DSS-augmented transformers for speech recognition, replacing convolutions with diagonal state space models, leading to improved WER on multiple datasets and insights into learned basis functions.
Contribution
The paper proposes a novel DSS-augmented transformer architecture that enhances speech recognition performance over conformers by integrating diagonal state space models.
Findings
Achieved 8.9%/6.7% WER on Switchboard 300/2000 hours.
Improved WER by 7% on MALACH dataset.
DSS layers learn damped Fourier basis functions.
Abstract
We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant of linear RNNs obtained by discretizing a linear dynamical system with a diagonal state transition matrix. DSS layers project the input sequence onto a space of orthogonal polynomials where the choice of basis functions, metric and support is controlled by the eigenvalues of the transition matrix. We compare neural transducers with either conformer or our proposed DSS-augmented transformer (DSSformer) encoders on three public corpora: Switchboard English conversational telephone speech 300 hours, Switchboard+Fisher 2000 hours, and a spoken archive of holocaust survivor testimonials called MALACH 176 hours. On Switchboard 300/2000 hours, we reach a single model performance of 8.9%/6.7% WER on the combined test set…
| Work | Model | Encoder | LM | Hub5’00 | Hub5’01 | ||
|---|---|---|---|---|---|---|---|
| swb | ch | avg | |||||
| [14] | AED | Conformer | – | 7.1 | 15.0 | 11.1 | – |
| [8] | AED | Conformer | – | 6.7 | 13.0 | 9.9 | 10.0 |
| LSTM∗ | 5.7 | 11.4 | 8.6 | 8.5 | |||
| +Trafo∗ | 5.5 | 11.2 | 8.4 | 8.5 | |||
| [7] | HMM | Conformer | n-gram | 7.1 | 13.5 | 10.3 | 10.4 |
| Trafo | 6.3 | 12.1 | 9.2 | 9.3 | |||
| [20] | RNN-T | LSTM | – | 6.9 | 14.5 | 10.7 | 11.2 |
| LSTM | 5.9 | 12.5 | 9.2 | 9.4 | |||
| [30] | RNN-T | Conformer | n-gram | – | – | 10.3 | 10.6 |
| Trafo | – | – | 9.3 | 9.4 | |||
| Ours | RNN-T | DSSformer | – | 6.7 | 13.4 | 10.0 | 10.3 |
| Trafo | 5.6 | 12.2 | 8.9 | 9.0 | |||
| Encoder | Hub5’00 | Hub5’01 | RT’03 | ||
|---|---|---|---|---|---|
| swb | ch | avg | |||
| Conformer (10L) | 5.2 | 8.5 | 6.9 | 7.6 | 7.8 |
| Conformer (12L) | 5.4 | 8.5 | 6.9 | 7.6 | 8.2 |
| HiPPO | 5.2 | 8.4 | 6.8 | 7.4 | 7.5 |
| S4D-Lin | 5.3 | 8.4 | 6.8 | 7.6 | 7.5 |
| 5.1 | 8.5 | 6.8 | 7.4 | 7.4 | |
| +length perturb. | 5.2 | 8.2 | 6.7 | 7.2 | 7.5 |
| Conformer AED [8] | 4.8 | 8.0 | 6.4 | 7.3 | 7.5 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
MethodsTest
Diagonal State Space Augmented Transformers for Speech Recognition
Abstract
We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant of linear RNNs obtained by discretizing a linear dynamical system with a diagonal state transition matrix. DSS layers project the input sequence onto a space of orthogonal polynomials where the choice of basis functions, metric and support is controlled by the eigenvalues of the transition matrix. We compare neural transducers with either conformer or our proposed DSS-augmented transformer (DSSformer) encoders on three public corpora: Switchboard English conversational telephone speech 300 hours, Switchboard+Fisher 2000 hours, and a spoken archive of holocaust survivor testimonials called MALACH 176 hours. On Switchboard 300/2000 hours, we reach a single model performance of 8.9%/6.7% WER on the combined test set of the Hub5 2000 evaluation, respectively, and on MALACH we improve the WER by 7% relative over the previous best published result. In addition, we present empirical evidence suggesting that DSS layers learn damped Fourier basis functions where the attenuation coefficients are layer specific whereas the frequency coefficients converge to almost identical linearly-spaced values across all layers.
Index Terms— structured state space models, diagonal state space models, neural transducers, end-to-end ASR
1 Introduction and related work
An interesting alternative to the ubiquitous transformer architecture is the recently introduced structured state space sequence model (S4) which showed promising results for modeling long range dependencies on the LRA (Long Range Arena) benchmark for sequence-level classification of different modalities such as text, images and mathematical expressions [1]. The main idea behind S4 is that the input sequence can be modeled as a linear RNN obtained by discretizing a continuous state space model. The physical meaning of a state in S4 is a time-varying vector of linear expansion coefficients used to approximate the input sequence with orthogonal polynomials under a given measure and support (weighting function and input window) [2]. The appeal of these models is that they can be efficiently implemented as full sequence convolutions running in instead of the complexity for self-attention with being the input sequence length. Moreover, these models are solidly grounded in function approximation theory and have interpretable parameters in terms of basis functions, measures and time sampling intervals.
In [1] the authors consider a diagonal plus low-rank approximation of the state transition matrix which simplifies the convolutional kernel estimation. In [3], the authors observed that there is no loss in performance when assuming that the transition matrix is diagonal with complex eigenvalues which is conceptually simpler and straightforward to implement compared to [1]. Because of this, diagonal state space (DSS) models will be adopted in this paper. In both works, the authors initialize the diagonal entries of the state transition matrix with the eigenvalues of a higher-order polynomial projection operator (HiPPO) matrix such that the input function is uniformly approximated with Legendre polynomials over a sliding window of fixed length. In [4] the authors argue that parameterizing the eigenvalues in log-space and initializing them with -exp for the real parts and +exp for the imaginary parts is just as effective and improve the DSS model further by augmenting it with self-attention to better capture local dependencies. In [5], the authors revisit the parameterization and initialization of DSS and propose eigenvalue initialization schemes with constant negative real parts with respect to the eigenvalue index and imaginary parts which scale either inversely or linearly with the eigenvalue index. The former results in projecting the input onto the space of Legendre polynomials with uniform weighting from the beginning of the sequence up to the current time whereas the latter amounts to using damped Fourier basis functions as approximators with an exponentially decaying weighted history.
While DSS has been primarily developed as an alternative to self-attention, the dual RNN/convolutional representation suggests that it has potential to outperform the depthwise temporal convolutions in the conformer architecture [6]. We echo the findings of [4] which indicate that self-attention and DSS exhibit complementary behaviour and do not necessarily subsume each other. Given the popularity and effectiveness of conformers for both hybrid [7] and end-to-end ASR [8, 9, 10, 11, 12], several other avenues have been explored in the literature to either improve the conformer architecture or the training recipe. In [13], the authors use grouped self-attention and progressive down-sampling to reduce the complexity of the self-attention layer. In [14], the authors provide training recipes and extensive comparisons between conformers and transformers on several corpora. In [15] the authors replace the transformer layer with performer. In [16], the authors use linear self-attention layers. In [7] the authors use two convolutional layers for each conformer block and layer normalization instead of batch norm. Similar to our work, in [17], the authors replace the convolutional layers with a more powerful representation called ConvNeXt.
The main contributions of this work are summarized below:
- •
We apply diagonal state space models to speech recognition and report experimental results on three public corpora.
- •
We show that DSSformers outperform conformers when used as encoders for neural transducers and achieve state-of-the-art results for single non-AED models on Switchboard telephony speech and MALACH.
- •
We study the effect of DSS initialization and provide some insights into what the DSS layers actually learn.
The rest of the paper is organized as follows: in section 2 we review the DSS formalism; in section 3 we present experimental evidence of its utility and in section 4 we summarize our findings.
2 DSS formulation
We briefly review the main concepts behind the diagonal state spaces framework for readers from the ASR community who may not be familiar with this new sequence-to-sequence modeling approach.
2.1 State space model
Borrowing some definitions and notations from [1, 3], a continuous state space model (SSM), sometimes referred to in the literature as a linear time-invariant or a linear dynamical system, is defined by the linear ODE:
[TABLE]
that maps the continuous 1-dimensional input to an -dimensional latent state before projecting it to a 1-dimensional output . The state space is parameterized by the state transition matrix as well as trainable parameters .
2.2 Discretization and link to linear RNNs
Consider a sampling interval and define the sampled input signal. Correspondingly, we have and . Equation (1) can be turned into a discrete recurrence that maps by integrating over under the zero-order hold (ZOH) assumption :
[TABLE]
2.3 Convolutional representation
With the convention , the recurrence can be unrolled and rewritten by eliminating the state variables :
[TABLE]
By grouping the scalar coefficients into the SSM kernel , (3) can be elegantly reformulated as a convolution
[TABLE]
Computing (4) naively would require operations. Instead, we observe that is the coefficient of of the product of two -degree univariate polynomials and . By the circular convolution theorem, this product can be computed efficiently in using FFT and its inverse.
2.4 Diagonal state spaces
Based on the above, computing from and is easy; the hard part is how to compute efficiently. The main result in [3] states that if is diagonalizable over with eigenvalues such that, , and , there such that
[TABLE]
where and . The proof uses the diagonalization of which, from the expression of from (2), implies , and the geometric series identity . We refer the reader to [3] for the complete proof.
2.5 DSS layer
A DSS layer operates as follows. It receives an input sequence and produces an output sequence where is the number of channels and is the sequence length. It does this by applying DSS kernels to the input (with a shortcut connection) according to (4), one for each coordinate. We apply a Gaussian Error Linear Unit (GELU) nonlinearity to the result followed by an pointwise linear layer needed to exchange information between the dimensions. After mixing, we apply a Gated Linear Unit (GLU) activation to the output. The implementation of a DSS layer as described so far is publicly available at https://github.com/ag1988/dss.
For a state space dimension , the trainable parameters of the DSS layer are: the diagonal entries of the transition matrix (tied across all channels), from (5), the time sampling intervals, and the output mixing matrix.
Just like the depthwise separable convolution module in the conformer architecture, the DSS layer is sandwiched between two pointwise convolutions which serve to increase the inner dimension (typically by a factor of 2) on which the layer operates as shown in Figure 1.
3 Experiments and results
We investigate the effectiveness of the proposed model on three public corpora: Switchboard English conversational telephone speech 300 hours, Switchboard+Fisher 2000 hours, and MALACH 176 hours.
3.1 Experiments on Switchboard 300 hours
The acoustic training data comprises 300 hours of English telephone conversations between two strangers on a preassigned topic. We follow the Kaldi s5c recipe [18] for data preparation and segmentation and report results on the Hub5 2000 (Switchboard and CallHome), Hub5 2001 and RT’03 test sets which are processed according to the LDC segmentation and scored using Kaldi WER measurement.
3.1.1 Feature processing
Our feature extraction and training recipe largely mirrors [19] with some notable differences. We extract 40-dimensional speaker independent log-Mel features every 10ms with speaker-based mean and variance normalization augmented with and coefficients. We perform temporal subsampling by a factor of 2 by stacking every two consecutive frames and skipping every second stacked frame which results in 50 240-dimensional feature vectors per second. Unlike [19, 20, 7], we do not use appended i-vectors as we found them to be less effective with conformer transducers. We create 4 additional replicas of the training data using speed and tempo perturbation [21] both with values in which, together with the original data, amounts to 1500 hours of training data every epoch. We perturb the data in three different ways: (i) sequence noise injection adds, with probability 0.8, a down-scaled spectrum of a random utterance to the current utterance [22]; (ii) SpecAugment randomly masks blocks in both time and frequency with the settings from [23]; (iii) Length perturbation randomly deletes and inserts contiguous frames with probability 0.7 [20].
3.1.2 Transducer architecture
We trained neural transducers (or RNN-Ts111Both terms are used interchangeably in the literature even for models where the encoder is not an RNN.) [24] with either conformer or DSSformer encoders with 12 layers, feed-forward dimension of 384 and 696-dimensional attention heads for an inner dimension of 512. All DSS layers use bidirectional kernels with a state space dimension =96. The joint network projects the 384-dim vectors from the last encoder layer to 256 and multiplies the result elementwise [19, 25] with a 256-dim projection of a label embedding computed by a unidirectional 1024-cell LSTM prediction network. After the application of hyperbolic tangent, the output is projected to 46 logits followed by a softmax layer corresponding to 45 characters plus BLANK. The baseline conformer RNN-T has an optimal size of 63M parameters and the DSSformer RNN-T has 73M parameters.
3.1.3 Training and decoding
The models were trained in Pytorch to minimize the RNN-T loss with CTC loss smoothing from the encoder with a weight of 0.1. Training was carried out on single A100 GPUs for 24 epochs with AdamW SGD and a one cycle learning rate policy which ramps up the step size linearly from 5e-5 to 5e-4 for the first 8 epochs followed by a linear annealing phase to 0 for the remaining 16 epochs. All experiments use a batch size of 64 utterances. Decoding was done using alignment-length synchronous beam search [26]. We also report results with density ratio shallow language model fusion [27] where the target LM is a 12-layer, 512-dimensional transformerXL character LM [28] trained on the Switchboard+Fisher acoustic transcripts (126M characters) and the source LM has the same configuration as the prediction network and was trained on the 300 hours transcripts only (15M characters).
3.1.4 DSS layer initialization and recognition results
In Table 1, we compare the performance of baseline conformer and DSSformer transducers with different initializations of the matrix. Concretely, HiPPO uses the top eigenvalues with positive imaginary part from the skew-symmetric matrix a_{ij}=\left\{\begin{array}[]{ll}2(i+1)^{1/2}(2j+1)^{1/2},&i<j\\ -1/2,&i=j\\ -2(i+1)^{1/2}(2j+1)^{1/2},&i>j\\ \end{array}\right. [3]. For exp random, where [4]. For S4D-Inv, , whereas for S4D-Lin, [5]. For all experiments, is parameterized in log-space with values drawn from and the real and imaginary parts for in (5) are initialized from .
The initialization from the last row in Table 1 was motivated by inspecting the converged values of when the DSS layers were initialized with S4D-Lin. Interestingly, the imaginary parts of converge from to approximately across all layers as shown in Figure 2(b). In contrast, in Figure 2(a) the real parts converge to values that are layer-dependent222The curves have been smoothed with Bezier interpolation for ease of visualization.. This suggests that the DSS layers learn damped Fourier basis functions where the attenuation coefficients are layer specific and the frequency coefficients are linearly spaced and common across layers. The benefit of using FFT layers for mixing input sequences has also been shown in the FNet architecture [29].
In Table 2 we compare the performance of our best single DSSformer model with existing approaches from the literature. Here, the model was trained for 30 epochs with length perturbation with the following settings from [20]: insertion and deletion probabilities of 0.7, 10% of frames selected as starting points for both, maximum deletion length of 7 frames and maximum insertion length of 3 frames. Length perturbation is lifted after 25 epochs.
3.2 Experiments on Switchboard+Fisher 2000 hours
The second set of experiments was carried out on 1975 hours (9875 hours after augmentation) comprised of 262 hours of Switchboard 1 audio with segmentations and transcripts provided by Mississippi State University plus 1698 hours from the Fisher data collection with transcripts provided by LDC plus 15 hours of CallHome audio. We trained neural transducers with either conformer (10 or 12 layers) or DSSformer encoders (10 layers), feed-forward dimension of 512 and 864-dimensional attention heads. All DSS layers use bidirectional kernels with a state space dimension =96. Training was carried out on 4 A100 GPUs with an effective batch size of 128 for 20 epochs with a one cycle LR policy with a maximum learning rate of 5e-4. The other settings are the same as in 3.1. In Table 3 we show a comparison of baseline conformer and DSSformer transducers with various initializations. As can be seen, DSSformer encoders outperform the conformer counterparts and the best initialization is the same as in 3.1. For contrast, we also compare our results with the single best performing model on this task from [8] and note that we achieve a comparable performance on two out of three test sets.
3.3 Experiments on MALACH 176 hours
Lastly, we test the proposed models on the public MALACH corpus [31] (released by LDC as LDC2019S11) which consists of Holocaust testimonies collected by the Survivors of the Shoah Visual History Foundation. The corpus is 16kHz audio broken down into 674 conversations totaling 176 hours for training (880 hours after augmentation) and 8 conversations of 3.1 hours for testing. The collection consists of unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching, and emotional speech, all of which present significant challenges for current ASR systems. Because of this, the error rates reported are significantly higher than for the previous corpora. We trained conformer and DSSformer transducers with the same feature extraction, architecture, DSS layer initialization and training recipe as in 3.1 without length perturbation and with S4D-Lin initialization. In Table 4 we report results with and without external LM fusion where the LM is a 10 layer 512-dimensional transformerXL trained on 7.2M characters. Our results show a 7% relative improvement in WER over the previous best hybrid LSTM approach.
4 Discussion
Diagonal state space models are a promising alternative to temporal convolutions with fixed-length kernels for ASR when used in a conformer-style architecture. We attribute their success to the connection with function approximation theory and to the interpretability of their parameters. In future work we will investigate better ways of integrating DSS layers with self-attention and feedforward modules as opposed to simply using them as a drop-in replacement for the depthwise convolutions in conformer. For example, the DSS mixing matrix can be combined with the second pointwise convolution which will simplify the overall architecture. Another avenue of research is to improve the initialization for the real parts of the eigenvalues of the state transition matrices and possibly keep the s fixed during training which will reduce the number of free parameters. Lastly, we plan to study the effectiveness of DSS for other end-to-end ASR modeling approaches.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” Ar Xiv preprint , vol. abs/2111.00396, 2021.
- 2[2] A. Gu, T. Dao, S. Ermon, et al., “Hippo: Recurrent memory with optimal polynomial projections,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual , H. Larochelle, M. Ranzato, R. Hadsell, et al., Eds., 2020.
- 3[3] A. Gupta, “Diagonal state spaces are as effective as structured state spaces,” Ar Xiv preprint , vol. abs/2203.14343, 2022.
- 4[4] H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur, “Long range language modeling via gated state spaces,” Ar Xiv preprint , vol. abs/2206.13947, 2022.
- 5[5] A. Gu, A. Gupta, K. Goel, and C. Ré, “On the parameterization and initialization of diagonal state space models,” Ar Xiv preprint , vol. abs/2206.11893, 2022.
- 6[6] A. Gulati, J. Qin, C. Chiu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020 , H. Meng, B. Xu, and T. F. Zheng, Eds. 2020, pp. 5036–5040, ISCA.
- 7[7] M. Zeineldeen, J. Xu, C. Lüscher, et al., “Improving the training recipe for a robust conformer-based hybrid model,” Ar Xiv preprint , vol. abs/2206.12955, 2022.
- 8[8] Z. Tüske, G. Saon, and B. Kingsbury, “On the limit of english conversational speech recognition,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021 , H. Hermansky, H. Cernocký, L. Burget, et al., Eds. 2021, pp. 2062–2066, ISCA.
