t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation   Capability

Jian Wu; Naoyuki Kanda; Takuya Yoshioka; Rui Zhao; Zhuo Chen; Jinyu Li

arXiv:2309.08131·eess.AS·September 18, 2023

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

Jian Wu, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao, Zhuo Chen, Jinyu Li

PDF

Open Access

TL;DR

This paper introduces a novel t-SOT FNT model for streaming multi-talker ASR that effectively separates language modeling from the transducer, enabling text-only domain adaptation while maintaining high recognition accuracy.

Contribution

It proposes a factorized neural transducer architecture for t-SOT, improving text-only adaptation capabilities in multi-talker speech recognition.

Findings

01

Achieves comparable WER to original t-SOT model

02

Enables effective text-only domain adaptation

03

Reduces word error rate on multi-talker datasets

Abstract

Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $⟨ cc ⟩$ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of $⟨ cc ⟩$ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\langle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing