End-to-End DOA-Guided Speech Extraction in Noisy Multi-Talker Scenarios

Kangqi Jing; Wenbin Zhang; Yu Gao

arXiv:2507.20926·eess.AS·July 29, 2025·Interspeech

End-to-End DOA-Guided Speech Extraction in Noisy Multi-Talker Scenarios

Kangqi Jing, Wenbin Zhang, Yu Gao

PDF

TL;DR

This paper introduces an end-to-end target speaker extraction model that uses DOA and beamwidth embeddings to improve speech clarity in noisy multi-talker environments, enhancing ASR performance.

Contribution

The novel model integrates DOA-guided embeddings into an end-to-end framework for robust speech extraction in complex multi-speaker scenarios.

Findings

01

Significant enhancement of target speech quality

02

Effective suppression of interference from other directions

03

Improved downstream ASR accuracy

Abstract

Target Speaker Extraction (TSE) plays a critical role in enhancing speech signals in noisy and multi-speaker environments. This paper presents an end-to-end TSE model that incorporates Direction of Arrival (DOA) and beamwidth embeddings to extract speech from a specified spatial region centered around the DOA. Our approach efficiently captures spatial and temporal features, enabling robust performance in highly complex scenarios with multiple simultaneous speakers. Experimental results demonstrate that the proposed model not only significantly enhances the target speech within the defined beamwidth but also effectively suppresses interference from other directions, producing a clear and isolated target voice. Furthermore, the model achieves remarkable improvements in downstream Automatic Speech Recognition (ASR) tasks, making it particularly suitable for real-world applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.