BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers
Eunjung Han, Chul Lee, Andreas Stolcke

TL;DR
This paper introduces BW-EDA-EEND, an online neural speaker diarization system capable of processing streaming audio with a variable number of speakers, maintaining high accuracy with low latency and linear computational complexity.
Contribution
It proposes a novel incremental Transformer-based architecture for online speaker diarization that handles variable speakers and balances accuracy with latency.
Findings
Outperforms baseline offline clustering diarization for 1-4 speakers with unlimited context.
Maintains comparable accuracy to offline systems with limited latency.
Degradation is moderate for up to two speakers with 10-second context.
Abstract
We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. We propose two variants: For unlimited-latency BW-EDA-EEND, which processes inputs in linear time, we show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Dropout · Layer Normalization · Attention Is All You Need · Multi-Head Attention · Byte Pair Encoding
