Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary   Network

Yiling Huang; Weiran Wang; Guanlong Zhao; Hank Liao; Wei Xia; Quan; Wang

arXiv:2309.08489·eess.AS·September 18, 2023

Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network

Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia, Quan, Wang

PDF

Open Access

TL;DR

This paper introduces WEEND, a multi-task neural network that performs simultaneous speech recognition and speaker diarization at the word level, streamlining the process and improving accuracy in multi-speaker scenarios.

Contribution

The paper presents a novel end-to-end neural architecture with auxiliary networks for joint speech recognition and speaker diarization, enabling real-time, word-level labeling.

Findings

01

Outperforms baseline on 2-speaker scenarios

02

Generalizes to 5-minute audio segments

03

Potential for high-quality diarization with sufficient training data

Abstract

While standard speaker diarization attempts to answer the question "who spoken when", most of relevant applications in reality are more interested in determining "who spoken what". Whether it is the conventional modularized approach or the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate the speaker labels with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same neural architecture. That is, while speech is being recognized, speaker labels are predicted simultaneously for each recognized word. Experimental results demonstrate that WEEND outperforms the turn-based diarization baseline system on all 2-speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing