On the Prediction Network Architecture in RNN-T for ASR
Dario Albesano, Jes\'us Andr\'es-Ferrer, Nicola Ferri, Puming, Zhan

TL;DR
This paper compares various prediction network architectures in RNN-T models for ASR, revealing that a new simple architecture, N-Concat, outperforms others in streaming scenarios, with significant WER improvements and reduced parameters.
Contribution
The study provides a comprehensive comparison of prediction networks in RNN-T models and introduces the N-Concat architecture that enhances performance and efficiency.
Findings
N-Concat outperforms other architectures in streaming benchmarks.
Transformer does not always outperform LSTM as a prediction network.
Up to 4.1% relative WER improvement with fewer parameters.
Abstract
RNN-T models have gained popularity in the literature and in commercial systems because of their competitiveness and capability of operating in online streaming mode. In this work, we conduct an extensive study comparing several prediction network architectures for both monotonic and original RNN-T models. We compare 4 types of prediction networks based on a common state-of-the-art Conformer encoder and report results obtained on Librispeech and an internal medical conversation data set. Our study covers both offline batch-mode and online streaming scenarios. In contrast to some previous works, our results show that Transformer does not always outperform LSTM when used as prediction network along with Conformer encoder. Inspired by our scoreboard, we propose a new simple prediction network architecture, N-Concat, that outperforms the others in our on-line streaming benchmark.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Voice and Speech Disorders
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Label Smoothing · Dropout · Layer Normalization · Absolute Position Encodings · Dense Connections
