On Addressing Practical Challenges for RNN-Transducer
Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong

TL;DR
This paper presents practical solutions for deploying RNN-Transducer speech recognition systems, including domain adaptation without audio data, word-level timestamping, and confidence scoring, validated on Microsoft data.
Contribution
It introduces a splicing data method for domain adaptation, a shared encoder for precise timestamping, and confidence scoring techniques with minimal computational overhead.
Findings
Splicing data method reduces word error rate by 58.03%.
Time stamping achieves less than 50ms error on average.
High confidence annotation with low computational cost.
Abstract
In this paper, several works are proposed to address practical challenges for deploying RNN Transducer (RNN-T) based speech recognition system. These challenges are adapting a well-trained RNN-T model to a new domain without collecting the audio data, obtaining time stamps and confidence scores at word level. The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data. To get the time stamp, a phone prediction branch is added to the RNN-T model by sharing the encoder for the purpose of force alignment. Finally, we obtain word-level confidence scores by utilizing several types of features calculated during decoding and from confusion network. Evaluated with Microsoft production data, the splicing data adaptation method improves the baseline and adaptation with the text to speech method by 58.03% and 15.25%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
