DiariST: Streaming Speech Translation with Speaker Diarization

Mu Yang; Naoyuki Kanda; Xiaofei Wang; Junkun Chen; Peidong Wang; Jian; Xue; Jinyu Li; Takuya Yoshioka

arXiv:2309.08007·eess.AS·January 24, 2024·1 cites

DiariST: Streaming Speech Translation with Speaker Diarization

Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian, Xue, Jinyu Li, Takuya Yoshioka

PDF

Open Access 1 Repo

TL;DR

DiariST is a pioneering streaming speech translation and speaker diarization system that handles overlapping speech in real-time, supported by a new dataset and evaluation metrics for this challenging task.

Contribution

It introduces the first streaming solution for speech translation with speaker diarization, along with a new dataset and metrics for evaluating such systems.

Findings

01

Achieves strong streaming speech translation and diarization performance.

02

Handles overlapping speech effectively in streaming mode.

03

Provides new benchmarks and tools for future research.

Abstract

End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mu-y/diarist
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing