Fast and accurate factorized neural transducer for text adaption of   end-to-end speech recognition models

Rui Zhao; Jian Xue; Partha Parthasarathy; Veljko Miljanic; Jinyu Li

arXiv:2212.01992·cs.CL·February 24, 2023

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

Rui Zhao, Jian Xue, Partha Parthasarathy, Veljko Miljanic, Jinyu Li

PDF

Open Access

TL;DR

This paper introduces methods to enhance the accuracy and adaptation speed of factorized neural transducers in end-to-end speech recognition, achieving significant word-error-rate reductions and faster adaptation.

Contribution

The paper proposes multiple techniques, including loss function modifications and language model integration, to improve FNT performance and adaptation efficiency.

Findings

01

Achieved a 9.48% relative WER reduction over standard FNT.

02

Enhanced adaptation speed through n-gram interpolation.

03

Improved text-only adaptation accuracy with proposed methods.

Abstract

Neural transducer is now the most popular end-to-end model for speech recognition, due to its naturally streaming ability. However, it is challenging to adapt it with text-only data. Factorized neural transducer (FNT) model was proposed to mitigate this problem. The improved adaptation ability of FNT on text-only adaptation data came at the cost of lowered accuracy compared to the standard neural transducer model. We propose several methods to improve the performance of the FNT model. They are: adding CTC criterion during training, adding KL divergence loss during adaptation, using a pre-trained language model to seed the vocabulary predictor, and an efficient adaptation approach by interpolating the vocabulary predictor with the n-gram language model. A combination of these approaches results in a relative word-error-rate reduction of 9.48\% from the standard FNT model. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings