All Neural Low-latency Directional Speech Extraction

Ashutosh Pandey; Sanha Lee; Juan Azcarreta; Daniel Wong; Buye Xu

arXiv:2407.04879·cs.SD·July 9, 2024

All Neural Low-latency Directional Speech Extraction

Ashutosh Pandey, Sanha Lee, Juan Azcarreta, Daniel Wong, Buye Xu

PDF

Open Access

TL;DR

This paper presents a neural network model for low-latency directional speech extraction that uses learned DOA embeddings and operates at high frame rates, enabling quick adaptation to dynamic environments.

Contribution

The proposed model introduces a novel approach by training DOA embeddings from scratch and integrating them into a recurrent neural network for real-time speech extraction.

Findings

01

Effective extraction of speech from specified directions.

02

Robustness to DOA mismatch.

03

Quick adaptation to abrupt DOA changes.

Abstract

We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted directional features, the proposed model trains DOA embeddings from scratch using speech enhancement loss, making it suitable for low-latency scenarios. Additionally, it operates at a high frame rate, taking in DOA with each input frame, which brings in the capability of quickly adapting to changing scene in highly dynamic real-world scenarios. We provide extensive evaluation to demonstrate the model's efficacy in directional speech extraction, robustness to DOA mismatch, and its capability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing