Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios

Jakob Kienegger; Timo Gerkmann

arXiv:2505.14517·eess.AS·May 21, 2025

Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios

Jakob Kienegger, Timo Gerkmann

PDF

Open Access

TL;DR

This paper introduces a weakly guided deep filtering approach for extracting moving speakers in dynamic environments, overcoming challenges of spatial ambiguity without relying on continuous directional cues.

Contribution

It proposes a novel deep tracking and joint training strategy that enables speaker extraction based only on initial position, improving performance in dynamic scenarios.

Findings

01

Outperforms strongly guided methods in dynamic scenarios

02

Resolves spatial ambiguities effectively

03

Demonstrates robustness with synthetic training data

Abstract

Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target's direction, manually tracking a moving speaker is impractical. Instead of relying on accurate time-dependent directional cues, which we refer to as strong guidance, in this paper we propose a weakly guided extraction method solely depending on the target's initial position to cope with spatial dynamic scenarios. By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing