EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed   Speaker Embeddings

Sung Hwan Mun; Min Hyun Han; Canyeong Moon; and Nam Soo Kim

arXiv:2312.06065·eess.AS·December 12, 2023·1 cites

EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings

Sung Hwan Mun, Min Hyun Han, Canyeong Moon, and Nam Soo Kim

PDF

Open Access

TL;DR

This paper introduces EEND-DEMUX, a neural speaker diarization model that disentangles speaker information in latent space, enabling direct extraction of speaker embeddings and improved diarization accuracy without external tools.

Contribution

The novel EEND-DEMUX framework effectively separates speaker embeddings using demultiplexing and multi-head cross-attention, advancing end-to-end neural speaker diarization methods.

Findings

01

Improved diarization performance on LibriMix dataset.

02

Effective disentanglement of speaker embeddings in latent space.

03

No external diarization system needed during inference.

Abstract

In recent years, there have been studies to further improve the end-to-end neural speaker diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel framework utilizing demultiplexed speaker embeddings. In this work, we focus on disentangling speaker-relevant information in the latent space and then transform each separated latent variable into its corresponding speech activity. EEND-DEMUX can directly obtain separated speaker embeddings through the demultiplexing operation in the inference phase without an external speaker diarization system, an embedding extractor, or a heuristic decoding technique. Furthermore, we employ a multi-head cross-attention mechanism to capture the correlation between mixture and separated speaker embeddings effectively. We formulate three loss functions based on matching, orthogonality, and sparsity constraints to learn robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsFocus