Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers
Juntae Kim, Jeehye Lee

TL;DR
This paper improves out-domain speech recognition by introducing sparse self-attention layers in Conformer RNN-T models, significantly reducing errors on long-form utterances through targeted modifications and a state reset method.
Contribution
It proposes sparse self-attention layers and a state reset technique to enhance Conformer RNN-T's out-domain generalization, addressing domain mismatch issues.
Findings
27.6% relative CER reduction on out-domain test data
Sparse self-attention effectively mitigates long-form utterance errors
Enhanced out-domain robustness over fully connected self-attention models
Abstract
Recurrent neural network transducer (RNN-T) is an end-to-end speech recognition framework converting input acoustic frames into a character sequence. The state-of-the-art encoder network for RNN-T is the Conformer, which can effectively model the local-global context information via its convolution and self-attention layers. Although Conformer RNN-T has shown outstanding performance, most studies have been verified in the setting where the train and test data are drawn from the same domain. The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet, which is an important issue for the product-level speech recognition system. In this study, we identified that fully connected self-attention layers in the Conformer caused high deletion errors, specifically in the long-form out-domain utterances. To address this problem, we introduce sparse self-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsPruning · Convolution
