Sequence-to-Sequence Neural Diarization with Automatic Speaker Detection and Representation

Ming Cheng; Yuke Lin; Ming Li

arXiv:2411.13849·eess.AS·June 24, 2025

Sequence-to-Sequence Neural Diarization with Automatic Speaker Detection and Representation

Ming Cheng, Yuke Lin, Ming Li

PDF

Open Access

TL;DR

This paper introduces a novel sequence-to-sequence neural diarization framework capable of online and offline speaker diarization, addressing speaker detection and representation without prior enrollment, and demonstrating high accuracy in experiments.

Contribution

It presents a new diarization paradigm that jointly learns speaker embeddings within the network and handles unknown speakers without prior enrollment.

Findings

01

Achieves high diarization accuracy in experiments.

02

Handles unknown speakers without prior enrollment.

03

Operates effectively in online and offline modes.

Abstract

This paper proposes a novel Sequence-to-Sequence Neural Diarization (S2SND) framework to perform online and offline speaker diarization. It is developed from the sequence-to-sequence architecture of our previous target-speaker voice activity detection system and then evolves into a new diarization paradigm by addressing two critical problems. 1) Speaker Detection: The proposed approach can utilize partially given speaker embeddings to discover the unknown speaker and predict the target voice activities in the audio signal. It does not require a prior diarization system for speaker enrollment in advance. 2) Speaker Representation: The proposed approach can adopt the predicted voice activities as reference information to extract speaker embeddings from the audio signal simultaneously. The representation space of speaker embedding is jointly learned within the whole diarization network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsADaptive gradient method with the OPTimal convergence rate