End-to-End DOA-Guided Speech Extraction in Noisy Multi-Talker Scenarios
Kangqi Jing, Wenbin Zhang, Yu Gao

TL;DR
This paper introduces an end-to-end target speaker extraction model that uses DOA and beamwidth embeddings to improve speech clarity in noisy multi-talker environments, enhancing ASR performance.
Contribution
The novel model integrates DOA-guided embeddings into an end-to-end framework for robust speech extraction in complex multi-speaker scenarios.
Findings
Significant enhancement of target speech quality
Effective suppression of interference from other directions
Improved downstream ASR accuracy
Abstract
Target Speaker Extraction (TSE) plays a critical role in enhancing speech signals in noisy and multi-speaker environments. This paper presents an end-to-end TSE model that incorporates Direction of Arrival (DOA) and beamwidth embeddings to extract speech from a specified spatial region centered around the DOA. Our approach efficiently captures spatial and temporal features, enabling robust performance in highly complex scenarios with multiple simultaneous speakers. Experimental results demonstrate that the proposed model not only significantly enhances the target speech within the defined beamwidth but also effectively suppresses interference from other directions, producing a clear and isolated target voice. Furthermore, the model achieves remarkable improvements in downstream Automatic Speech Recognition (ASR) tasks, making it particularly suitable for real-world applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
