Adaptive Rotary Steering with Joint Autoregression for Robust Extraction of Closely Moving Speakers in Dynamic Scenarios
Jakob Kienegger, Timo Gerkmann

TL;DR
This paper introduces a joint autoregressive framework that enhances the robustness of adaptive rotary steering in dynamic multi-speaker scenarios, effectively tracking and separating closely spaced moving speakers using temporal-spectral correlations.
Contribution
It proposes a novel joint autoregressive approach that incorporates processed recordings as guidance, improving tracking and separation of moving speakers in complex acoustic environments.
Findings
Significant improvement in tracking accuracy for closely spaced speakers
Outperforms non-autoregressive methods on synthetic datasets
Effective in real-world scenarios with multiple crossings
Abstract
Latest advances in deep spatial filtering for Ambisonics demonstrate strong performance in stationary multi-speaker scenarios by rotating the sound field toward a target speaker prior to multi-channel enhancement. For applicability in dynamic acoustic conditions with moving speakers, we propose to automate this rotary steering using an interleaved tracking algorithm conditioned on the target's initial direction. However, for nearby or crossing speakers, robust tracking becomes difficult and spatial cues less effective for enhancement. By incorporating the processed recording as additional guide into both algorithms, our novel joint autoregressive framework leverages temporal-spectral correlations of speech to resolve spatially challenging speaker constellations. Consequently, our proposed method significantly improves tracking and enhancement of closely spaced speakers, consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
