LongFNT: Long-form Speech Recognition with Factorized Neural Transducer
Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

TL;DR
This paper introduces LongFNT, a novel long-form speech recognition model that leverages a factorized neural transducer with integrated language modeling and contextual encoding, achieving significant WER reductions on LibriSpeech and GigaSpeech.
Contribution
It proposes the LongFNT architecture that fuses long-form features with the neural transducer, incorporating a pre-trained language model to improve long-form speech recognition.
Findings
Achieves 19% WER reduction on LibriSpeech
Achieves 12% WER reduction on GigaSpeech
Demonstrates effectiveness of long-form feature integration
Abstract
Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Weight Decay · Dropout · Attention Dropout · Residual Connection · Linear Warmup With Linear Decay · Linear Layer · Adam · WordPiece
