Distributed Asynchronous Device Speech Enhancement via Windowed Cross-Attention
Gene-Ping Yang, Sebastian Braun

TL;DR
This paper introduces a windowed cross-attention module for neural multi-microphone processing that effectively handles asynchronous microphones with varying latency and clock drift, improving speech enhancement in dynamic environments.
Contribution
It presents a novel windowed cross-attention mechanism that aligns features across asynchronous microphones, enhancing existing models for real-world scenarios.
Findings
Outperforms TAC in noisy reverberant environments
Faster convergence and better learning in experiments
Effective in multi-talker and asynchronous setups
Abstract
The increasing number of microphone-equipped personal devices offers great flexibility and potential using them as ad-hoc microphone arrays in dynamic meeting environments. However, most existing approaches are designed for time-synchronized microphone setups, a condition that may not hold in real-world meeting scenarios, where time latency and clock drift vary across devices. Under such conditions, we found transform-average-concatenate (TAC), a popular module for neural multi-microphone processing, insufficient in handling time-asynchronous microphones. In response, we propose a windowed cross-attention module capable of dynamically aligning features between all microphones. This module is invariant to both the permutation and the number of microphones and can be easily integrated into existing models. Furthermore, we propose an optimal training target for multi-talker environments.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Indoor and Outdoor Localization Technologies
