Dual-Strategy-Enhanced ConBiMamba for Neural Speaker Diarization

Zhen Liao; Gaole Dai; Mengqiao Chen; Wenqing Cheng; Wei Xu

arXiv:2601.19472·cs.SD·January 28, 2026

Dual-Strategy-Enhanced ConBiMamba for Neural Speaker Diarization

Zhen Liao, Gaole Dai, Mengqiao Chen, Wenqing Cheng, Wei Xu

PDF

Open Access

TL;DR

This paper introduces a dual-strategy neural speaker diarization system, combining Conformer and Mamba architectures with novel loss and feature aggregation techniques, achieving state-of-the-art results on multiple datasets.

Contribution

It proposes the ConBiMamba model with boundary-enhanced loss and layer-wise feature aggregation, improving local detail modeling and long-range dependency handling in speaker diarization.

Findings

01

Achieves state-of-the-art performance on four datasets

02

Effectively handles long audio sequences with ExtBiMamba

03

Improves speaker change point detection accuracy

Abstract

Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer's self-attention incurs high memory overhead for long speech sequences and may cause instability in long-range dependency modeling. These limitations are critical for diarization, which requires both precise modeling of local variations and robust speaker consistency over extended spans. To address these challenges, we first apply ConBiMamba for speaker diarization. We follow the Pyannote pipeline and propose the Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system. ConBiMamba integrates the strengths of Conformer and Mamba, where Conformer's convolutional and feed-forward structures are utilized to improve local feature extraction. By replacing Conformer's self-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis