LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

Xun Gong; Yu Wu; Jinyu Li; Shujie Liu; Rui Zhao; Xie Chen; Yanmin Qian

arXiv:2211.09412·cs.SD·November 18, 2022

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

PDF

Open Access

TL;DR

This paper introduces LongFNT, a novel long-form speech recognition model that leverages a factorized neural transducer with integrated language modeling and contextual encoding, achieving significant WER reductions on LibriSpeech and GigaSpeech.

Contribution

It proposes the LongFNT architecture that fuses long-form features with the neural transducer, incorporating a pre-trained language model to improve long-form speech recognition.

Findings

01

Achieves 19% WER reduction on LibriSpeech

02

Achieves 12% WER reduction on GigaSpeech

03

Demonstrates effectiveness of long-form feature integration

Abstract

Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Weight Decay · Dropout · Attention Dropout · Residual Connection · Linear Warmup With Linear Decay · Linear Layer · Adam · WordPiece