All Neural Low-latency Directional Speech Extraction
Ashutosh Pandey, Sanha Lee, Juan Azcarreta, Daniel Wong, Buye Xu

TL;DR
This paper presents a neural network model for low-latency directional speech extraction that uses learned DOA embeddings and operates at high frame rates, enabling quick adaptation to dynamic environments.
Contribution
The proposed model introduces a novel approach by training DOA embeddings from scratch and integrating them into a recurrent neural network for real-time speech extraction.
Findings
Effective extraction of speech from specified directions.
Robustness to DOA mismatch.
Quick adaptation to abrupt DOA changes.
Abstract
We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted directional features, the proposed model trains DOA embeddings from scratch using speech enhancement loss, making it suitable for low-latency scenarios. Additionally, it operates at a high frame rate, taking in DOA with each input frame, which brings in the capability of quickly adapting to changing scene in highly dynamic real-world scenarios. We provide extensive evaluation to demonstrate the model's efficacy in directional speech extraction, robustness to DOA mismatch, and its capability to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
