Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs

Florian Kreyssig; Chao Zhang; Philip Woodland

arXiv:1802.06412·cs.CL·February 21, 2018

Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs

Florian Kreyssig, Chao Zhang, Philip Woodland

PDF

TL;DR

This paper enhances TDNNs for speech recognition by deepening kernels and integrating frequency-dependent Grid-RNNs, resulting in significant reductions in word error rates on broadcast speech data.

Contribution

It introduces deep residual kernels in TDNNs and combines them with frequency-dependent Grid-RNNs for improved acoustic modeling.

Findings

01

Deep kernel TDNNs reduce WER by 6%.

02

Frequency-dependent Grid-RNNs further reduce WER by 9%.

03

Bi-directional Grid-RNNs outperform CNNs for spectral processing.

Abstract

Time delay neural networks (TDNNs) are an effective acoustic model for large vocabulary speech recognition. The strength of the model can be attributed to its ability to effectively model long temporal contexts. However, current TDNN models are relatively shallow, which limits the modelling capability. This paper proposes a method of increasing the network depth by deepening the kernel used in the TDNN temporal convolutions. The best performing kernel consists of three fully connected layers with a residual (ResNet) connection from the output of the first to the output of the third. The addition of spectro-temporal processing as the input to the TDNN in the form of a convolutional neural network (CNN) and a newly designed Grid-RNN was investigated. The Grid-RNN strongly outperforms a CNN if different sets of parameters for different frequency bands are used and can be further enhanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.