One model to rule them all ? Towards End-to-End Joint Speaker   Diarization and Speech Recognition

Samuele Cornell; Jee-weon Jung; Shinji Watanabe; Stefano Squartini

arXiv:2310.01688·eess.AS·October 4, 2023·2 cites

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini

PDF

Open Access

TL;DR

This paper introduces SLIDAR, an end-to-end framework that jointly performs speaker diarization and speech recognition on arbitrary-length audio, effectively identifying who spoke what and when in various scenarios.

Contribution

The paper proposes a novel sliding-window approach with an end-to-end model that jointly handles diarization and ASR, capable of processing arbitrary input lengths and multiple speakers.

Findings

01

Effective in both close-talk and far-field scenarios

02

Outperforms separate SD and ASR pipelines in experiments

03

Handles arbitrary-length inputs with a unified model

Abstract

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and ``Whisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing