On the Role of Spatial Features in Foundation-Model-Based Speaker Diarization

Marc Deegen; Tobias Gburrek; Tobias Cord-Landwehr; Thilo von Neumann; Jiangyu Han; Luk\'a\v{s} Burget; Reinhold Haeb-Umbach

arXiv:2601.02231·eess.AS·January 6, 2026

On the Role of Spatial Features in Foundation-Model-Based Speaker Diarization

Marc Deegen, Tobias Gburrek, Tobias Cord-Landwehr, Thilo von Neumann, Jiangyu Han, Luk\'a\v{s} Burget, Reinhold Haeb-Umbach

PDF

Open Access

TL;DR

This paper investigates how incorporating spatial features affects the performance of foundation-model-based speaker diarization, revealing that existing models already capture much spatial information, limiting the gains from explicit spatial cues.

Contribution

It analyzes the impact of adding spatial information to a state-of-the-art single-channel diarization system using foundation models, highlighting the limitations of spatial cues in current approaches.

Findings

01

Spatial information can improve diarization performance.

02

Features aggregated over WavLM layers already encode much spatial information.

03

The overall improvement from spatial cues is smaller than expected.

Abstract

Recent advances in speaker diarization exploit large pretrained foundation models, such as WavLM, to achieve state-of-the-art performance on multiple datasets. Systems like DiariZen leverage these rich single-channel representations, but are limited to single-channel audio, preventing the use of spatial cues available in multi-channel recordings. This work analyzes the impact of incorporating spatial information into a state-of-the-art single-channel diarization system by evaluating several strategies for conditioning the model on multi-channel spatial features. Experiments on meeting-style datasets indicate that spatial information can improve diarization performance, but the overall improvement is smaller than expected for the proposed system, suggesting that the features aggregated over all WavLM layers already capture much of the information needed for accurate speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing