Mamba-based Segmentation Model for Speaker Diarization
Alexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi,, Atsushi Ando, Shoko Araki

TL;DR
This paper introduces Mamba, a novel neural architecture with attention-like capabilities, which improves speaker diarization by enabling longer context processing and surpasses existing models in performance.
Contribution
The paper proposes Mamba, a new RNN-like architecture with attention features, demonstrating its effectiveness for speaker diarization and outperforming existing models.
Findings
Mamba enables longer local window processing for diarization.
Mamba-based system achieves state-of-the-art results on multiple datasets.
Mamba outperforms traditional RNN and attention-based models.
Abstract
Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparing the state-of-the-art neural segmentation of the pyannote pipeline with our proposed Mamba-based variant. Mamba's stronger processing capabilities allow usage of longer local windows, which significantly improve diarization quality by making the speaker embedding extraction more reliable. We find Mamba to be a superior alternative to both traditional RNN and the tested attention-based model. Our proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
