Moving Speaker Separation via Parallel Spectral-Spatial Processing

Yuzhu Wang; Archontis Politis; Konstantinos Drossos; Tuomas Virtanen

arXiv:2602.22487·eess.AS·February 27, 2026

Moving Speaker Separation via Parallel Spectral-Spatial Processing

Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

PDF

Open Access

TL;DR

This paper introduces a dual-branch parallel spectral-spatial architecture for moving speaker separation, effectively modeling spectral and spatial features separately and integrating them via cross-attention, leading to significant performance improvements.

Contribution

The novel parallel spectral-spatial (PS2) architecture separates spectral and spatial processing streams with adaptive fusion, outperforming existing methods in dynamic speaker separation scenarios.

Findings

01

Outperforms state-of-the-art by 1.6-2.2 dB SI-SDR

02

Maintains over 13 dB SI-SDR improvement with fast source movements

03

Robust across various reverberation and noise conditions

Abstract

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Blind Source Separation Techniques