Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition
Tom Sercu, Vaibhava Goel

TL;DR
This paper adapts dense prediction techniques from computer vision to speech recognition, introducing time-dilated convolutions for efficient sequence labeling, achieving state-of-the-art results on the Hub5 Switchboard benchmark.
Contribution
It introduces time-dilated convolutions for sequence prediction, enabling efficient dense prediction in speech recognition with improved performance.
Findings
Achieved 7.7% WER on Hub5 Switchboard-2000 with a single model.
Demonstrated the effectiveness of dense prediction and batch normalization in speech tasks.
Proposed an asymmetric dilated convolution for efficient temporal pooling.
Abstract
In computer vision pixelwise dense prediction is the task of predicting a label for each pixel in the image. Convolutional neural networks achieve good performance on this task, while being computationally efficient. In this paper we carry these ideas over to the problem of assigning a sequence of labels to a set of speech frames, a task commonly known as framewise classification. We show that dense prediction view of framewise classification offers several advantages and insights, including computational efficiency and the ability to apply batch normalization. When doing dense prediction we pay specific attention to strided pooling in time and introduce an asymmetric dilated convolution, called time-dilated convolution, that allows for efficient and elegant implementation of pooling in time. We show results using time-dilated convolutions in a very deep VGG-style CNN with batch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsBatch Normalization
