TL;DR
This paper introduces t-SOT, a streaming multi-talker ASR framework with a single output stream that efficiently recognizes multiple speakers' speech in real-time, outperforming previous models in accuracy and simplicity.
Contribution
The paper presents a novel token-level serialized output training framework for streaming multi-talker ASR with a single output branch, reducing complexity and inference cost.
Findings
Achieves state-of-the-art word error rates on LibriSpeechMix and LibriCSS datasets.
Performs comparably to single-talker ASR on non-overlapping speech.
Simplifies model architecture while maintaining high accuracy.
Abstract
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
