Generalizing RNN-Transducer to Out-Domain Audio via Sparse   Self-Attention Layers

Juntae Kim; Jeehye Lee

arXiv:2108.10752·eess.AS·June 20, 2022·1 cites

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Juntae Kim, Jeehye Lee

PDF

Open Access

TL;DR

This paper improves out-domain speech recognition by introducing sparse self-attention layers in Conformer RNN-T models, significantly reducing errors on long-form utterances through targeted modifications and a state reset method.

Contribution

It proposes sparse self-attention layers and a state reset technique to enhance Conformer RNN-T's out-domain generalization, addressing domain mismatch issues.

Findings

01

27.6% relative CER reduction on out-domain test data

02

Sparse self-attention effectively mitigates long-form utterance errors

03

Enhanced out-domain robustness over fully connected self-attention models

Abstract

Recurrent neural network transducer (RNN-T) is an end-to-end speech recognition framework converting input acoustic frames into a character sequence. The state-of-the-art encoder network for RNN-T is the Conformer, which can effectively model the local-global context information via its convolution and self-attention layers. Although Conformer RNN-T has shown outstanding performance, most studies have been verified in the setting where the train and test data are drawn from the same domain. The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet, which is an important issue for the product-level speech recognition system. In this study, we identified that fully connected self-attention layers in the Conformer caused high deletion errors, specifically in the long-form out-domain utterances. To address this problem, we introduce sparse self-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsPruning · Convolution