Low-latency Speech Enhancement via Speech Token Generation

Huaying Xue; Xiulian Peng; Yan Lu

arXiv:2310.08981·cs.SD·January 24, 2024·1 cites

Low-latency Speech Enhancement via Speech Token Generation

Huaying Xue, Xiulian Peng, Yan Lu

PDF

Open Access

TL;DR

This paper introduces a low-latency speech enhancement method that treats the task as a speech generation problem, using a neural codec and auto-regressive modeling to improve robustness and scalability over traditional data-driven approaches.

Contribution

It proposes a novel conditional generative framework with explicit alignment and single-stage speech code generation for enhanced noise robustness and low latency.

Findings

01

Outperforms data-driven methods in noise robustness

02

Achieves high speech quality with low latency

03

Demonstrates effectiveness on synthetic and real data

Abstract

Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation