t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability
Jian Wu, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao, Zhuo Chen, Jinyu Li

TL;DR
This paper introduces a novel t-SOT FNT model for streaming multi-talker ASR that effectively separates language modeling from the transducer, enabling text-only domain adaptation while maintaining high recognition accuracy.
Contribution
It proposes a factorized neural transducer architecture for t-SOT, improving text-only adaptation capabilities in multi-talker speech recognition.
Findings
Achieves comparable WER to original t-SOT model
Enables effective text-only domain adaptation
Reduces word error rate on multi-talker datasets
Abstract
Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\langle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
