On the Prediction Network Architecture in RNN-T for ASR

Dario Albesano; Jes\'us Andr\'es-Ferrer; Nicola Ferri; Puming; Zhan

arXiv:2206.14618·eess.AS·June 30, 2022

On the Prediction Network Architecture in RNN-T for ASR

Dario Albesano, Jes\'us Andr\'es-Ferrer, Nicola Ferri, Puming, Zhan

PDF

Open Access

TL;DR

This paper compares various prediction network architectures in RNN-T models for ASR, revealing that a new simple architecture, N-Concat, outperforms others in streaming scenarios, with significant WER improvements and reduced parameters.

Contribution

The study provides a comprehensive comparison of prediction networks in RNN-T models and introduces the N-Concat architecture that enhances performance and efficiency.

Findings

01

N-Concat outperforms other architectures in streaming benchmarks.

02

Transformer does not always outperform LSTM as a prediction network.

03

Up to 4.1% relative WER improvement with fewer parameters.

Abstract

RNN-T models have gained popularity in the literature and in commercial systems because of their competitiveness and capability of operating in online streaming mode. In this work, we conduct an extensive study comparing several prediction network architectures for both monotonic and original RNN-T models. We compare 4 types of prediction networks based on a common state-of-the-art Conformer encoder and report results obtained on Librispeech and an internal medical conversation data set. Our study covers both offline batch-mode and online streaming scenarios. In contrast to some previous works, our results show that Transformer does not always outperform LSTM when used as prediction network along with Conformer encoder. Inspired by our scoreboard, we propose a new simple prediction network architecture, N-Concat, that outperforms the others in our on-line streaming benchmark.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Voice and Speech Disorders

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Label Smoothing · Dropout · Layer Normalization · Absolute Position Encodings · Dense Connections