Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Jakob Kienegger; Timo Gerkmann

arXiv:2603.23723·eess.AS·March 26, 2026

Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Jakob Kienegger, Timo Gerkmann

PDF

Open Access

TL;DR

This paper introduces a Bayesian tracking method that autoregressively guides deep spatial filters for moving speaker enhancement, improving accuracy and robustness in dynamic scenarios with minimal additional computational cost.

Contribution

It proposes a novel Bayesian tracking approach compatible with deep spatial filters and demonstrates its effectiveness with a new social force model-based dataset.

Findings

01

Autoregressive guidance improves tracking accuracy.

02

Enhanced speech quality with negligible computational overhead.

03

Method generalizes well to real-world, challenging acoustic environments.

Abstract

Deep spatially selective filters achieve high-quality enhancement with real-time capable architectures for stationary speakers of known directions. To retain this level of performance in dynamic scenarios when only the speakers' initial directions are given, accurate, yet computationally lightweight tracking algorithms become necessary. Assuming a frame-wise causal processing style, temporal feedback allows for leveraging the enhanced speech signal to improve tracking performance. In this work, we investigate strategies to incorporate the enhanced signal into lightweight tracking algorithms and autoregressively guide deep spatial filters. Our proposed Bayesian tracking algorithms are compatible with arbitrary deep spatial filters. To increase the realism of simulated trajectories during development and evaluation, we propose and publish a novel dataset based on the social force model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis