Lightweight speech enhancement guided target speech extraction in noisy multi-speaker scenarios
Ziling Huang, Junnan Wu, Lichun Fan, Zhenbo Luo, Jian Luan, Haixin Guan, Yanhua Long

TL;DR
This paper introduces a lightweight speech enhancement guided target speech extraction model, GTCRN, that improves performance in noisy multi-speaker scenarios through novel extensions and training strategies.
Contribution
The paper proposes LGTSE and D-LGTSE extensions to enhance TSE robustness in noisy environments, along with a two-stage training strategy for better performance.
Findings
Achieved 0.89 dB SISDR improvement on Libri2Mix
Improved PESQ by 0.16 and STOI by 1.97%
Validated effectiveness in noisy multi-speaker scenarios
Abstract
Target speech extraction (TSE) has achieved strong performance in relatively simple conditions such as one-speaker-plus-noise and two-speaker mixtures, but its performance remains unsatisfactory in noisy multi-speaker scenarios. To address this issue, we introduce a lightweight speech enhancement model, GTCRN, to better guide TSE in noisy environments. Building on our competitive previous speaker embedding/encoder-free framework SEF-PNet, we propose two extensions: LGTSE and D-LGTSE. LGTSE incorporates noise-agnostic enrollment guidance by denoising the input noisy speech before context interaction with enrollment speech, thereby reducing noise interference. D-LGTSE further improves system robustness against speech distortion by leveraging denoised speech as an additional noisy input during training, expanding the dynamic range of noisy conditions and enabling the model to directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
