Leveraging Speaker Embeddings in End-to-End Neural Diarization for   Two-Speaker Scenarios

Juan Ignacio Alvarez-Trejos; Beltr\'an Labrador; Alicia Lozano-Diez

arXiv:2407.01317·cs.SD·July 2, 2024

Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

Juan Ignacio Alvarez-Trejos, Beltr\'an Labrador, Alicia Lozano-Diez

PDF

Open Access

TL;DR

This paper enhances end-to-end neural speaker diarization for two-speaker scenarios by integrating speaker embeddings, leading to significant error rate reductions while preserving overlap handling capabilities.

Contribution

It introduces methods for incorporating speaker embeddings into end-to-end models and analyzes key factors like silence handling and embedding extraction parameters.

Findings

01

Achieved a 10.78% relative reduction in diarization error rate.

02

Demonstrated improved speaker discrimination in two-speaker scenarios.

03

Validated effectiveness on the CallHome dataset.

Abstract

End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities, while maintaining their overlap handling strengths. To achieve this, we propose several methods for incorporating these embeddings along the acoustic features. Furthermore, we delve into an analysis of the correct handling of silence frames, the window length for extracting speaker embeddings and the transformer encoder size. The effectiveness of our proposed approach is thoroughly evaluated on the CallHome dataset for the two-speaker diarization task, with results that demonstrate a significant reduction in diarization error rates achieving a relative improvement of a 10.78% compared to the baseline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing