Low-latency Speech Enhancement via Speech Token Generation
Huaying Xue, Xiulian Peng, Yan Lu

TL;DR
This paper introduces a low-latency speech enhancement method that treats the task as a speech generation problem, using a neural codec and auto-regressive modeling to improve robustness and scalability over traditional data-driven approaches.
Contribution
It proposes a novel conditional generative framework with explicit alignment and single-stage speech code generation for enhanced noise robustness and low latency.
Findings
Outperforms data-driven methods in noise robustness
Achieves high speech quality with low latency
Demonstrates effectiveness on synthetic and real data
Abstract
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation
