Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang,, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

TL;DR
This paper introduces a streaming SA-ASR model that jointly transcribes multi-talker speech and recognizes speaker identities with low latency, using token-level speaker embeddings called t-vectors.
Contribution
It proposes a novel encoder-decoder speaker embedding extractor integrated with token-level serialized output training for joint transcription and speaker recognition.
Findings
Achieves better accuracy than previous streaming models.
Performs comparably or better than offline SA-ASR models.
Effective in recognizing speakers in overlapping speech.
Abstract
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
