BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a   Variable Number of Speakers

Eunjung Han; Chul Lee; Andreas Stolcke

arXiv:2011.02678·cs.SD·February 22, 2022

BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers

Eunjung Han, Chul Lee, Andreas Stolcke

PDF

TL;DR

This paper introduces BW-EDA-EEND, an online neural speaker diarization system capable of processing streaming audio with a variable number of speakers, maintaining high accuracy with low latency and linear computational complexity.

Contribution

It proposes a novel incremental Transformer-based architecture for online speaker diarization that handles variable speakers and balances accuracy with latency.

Findings

01

Outperforms baseline offline clustering diarization for 1-4 speakers with unlimited context.

02

Maintains comparable accuracy to offline systems with limited latency.

03

Degradation is moderate for up to two speakers with 10-second context.

Abstract

We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. We propose two variants: For unlimited-latency BW-EDA-EEND, which processes inputs in linear time, we show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Dropout · Layer Normalization · Attention Is All You Need · Multi-Head Attention · Byte Pair Encoding