Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Naoyuki Kanda; Jian Wu; Yu Wu; Xiong Xiao; Zhong Meng; Xiaofei Wang,; Yashesh Gaur; Zhuo Chen; Jinyu Li; Takuya Yoshioka

arXiv:2202.00842·eess.AS·July 18, 2022

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang,, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

PDF

1 Repo

TL;DR

This paper introduces t-SOT, a streaming multi-talker ASR framework with a single output stream that efficiently recognizes multiple speakers' speech in real-time, outperforming previous models in accuracy and simplicity.

Contribution

The paper presents a novel token-level serialized output training framework for streaming multi-talker ASR with a single output branch, reducing complexity and inference cost.

Findings

01

Achieves state-of-the-art word error rates on LibriSpeechMix and LibriCSS datasets.

02

Performs comparably to single-talker ASR on non-overlapping speech.

03

Simplifies model architecture while maintaining high accuracy.

Abstract

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mu-y/diarist
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.