Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs
Florian Kreyssig, Chao Zhang, Philip Woodland

TL;DR
This paper enhances TDNNs for speech recognition by deepening kernels and integrating frequency-dependent Grid-RNNs, resulting in significant reductions in word error rates on broadcast speech data.
Contribution
It introduces deep residual kernels in TDNNs and combines them with frequency-dependent Grid-RNNs for improved acoustic modeling.
Findings
Deep kernel TDNNs reduce WER by 6%.
Frequency-dependent Grid-RNNs further reduce WER by 9%.
Bi-directional Grid-RNNs outperform CNNs for spectral processing.
Abstract
Time delay neural networks (TDNNs) are an effective acoustic model for large vocabulary speech recognition. The strength of the model can be attributed to its ability to effectively model long temporal contexts. However, current TDNN models are relatively shallow, which limits the modelling capability. This paper proposes a method of increasing the network depth by deepening the kernel used in the TDNN temporal convolutions. The best performing kernel consists of three fully connected layers with a residual (ResNet) connection from the output of the first to the output of the third. The addition of spectro-temporal processing as the input to the TDNN in the form of a convolutional neural network (CNN) and a newly designed Grid-RNN was investigated. The Grid-RNN strongly outperforms a CNN if different sets of parameters for different frequency bands are used and can be further enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
