Improving End-to-End Neural Diarization Using Conversational Summary Representations
Samuel J. Broughton, Lahiru Samarakoon

TL;DR
This paper enhances end-to-end neural speaker diarization by replacing zero vector inputs with learned conversational summary representations, leading to improved diarization accuracy across multiple datasets.
Contribution
The study introduces learned conversational summary representations into EEND-EDA, improving speaker attractor generation and diarization performance.
Findings
Achieved 1.90% absolute DER improvement over baseline
Proposed three methods for initializing summary vectors
Investigated effects of varying input recording lengths
Abstract
Speaker diarization is a task concerned with partitioning an audio recording by speaker identity. End-to-end neural diarization with encoder-decoder based attractor calculation (EEND-EDA) aims to solve this problem by directly outputting diarization results for a flexible number of speakers. Currently, the EDA module responsible for generating speaker-wise attractors is conditioned on zero vectors providing no relevant information to the network. In this work, we extend EEND-EDA by replacing the input zero vectors to the decoder with learned conversational summary representations. The updated EDA module sequentially generates speaker-wise attractors based on utterance-level information. We propose three methods to initialize the summary vector and conduct an investigation into varying input recording lengths. On a range of publicly available test sets, our model achieves an absolute DER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
