Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Naoyuki Kanda; Jian Wu; Yu Wu; Xiong Xiao; Zhong Meng; Xiaofei Wang,; Yashesh Gaur; Zhuo Chen; Jinyu Li; Takuya Yoshioka

arXiv:2203.16685·eess.AS·July 18, 2022

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang,, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

PDF

Open Access 1 Repo

TL;DR

This paper introduces a streaming SA-ASR model that jointly transcribes multi-talker speech and recognizes speaker identities with low latency, using token-level speaker embeddings called t-vectors.

Contribution

It proposes a novel encoder-decoder speaker embedding extractor integrated with token-level serialized output training for joint transcription and speaker recognition.

Findings

01

Achieves better accuracy than previous streaming models.

02

Performs comparably or better than offline SA-ASR models.

03

Effective in recognizing speakers in overlapping speech.

Abstract

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mu-y/diarist
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing