Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising

Ye-Xin Lu; Hui-Peng Du; Fei Liu; Yang Ai; Zhen-Hua Ling

arXiv:2505.13830·eess.AS·May 23, 2025

Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising

Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai, Zhen-Hua Ling

PDF

Open Access

TL;DR

This paper introduces a neural codec-based speech denoiser integrated with LauraTTS, significantly improving noise robustness in zero-shot TTS by effectively removing noise from acoustic prompts.

Contribution

It presents a novel neural codec-based denoiser that enhances zero-shot TTS noise robustness, outperforming existing speech enhancement methods.

Findings

01

Codec denoiser outperforms state-of-the-art SE methods.

02

Noise-robust LauraTTS surpasses approaches with additional SE models.

03

High-quality personalized speech synthesis achieved in noisy conditions.

Abstract

Large language model (LLM) based zero-shot text-to-speech (TTS) methods tend to preserve the acoustic environment of the audio prompt, leading to degradation in synthesized speech quality when the audio prompt contains noise. In this paper, we propose a novel neural codec-based speech denoiser and integrate it with the advanced LLM-based TTS model, LauraTTS, to achieve noise-robust zero-shot TTS. The proposed codec denoiser consists of an audio codec, a token denoiser, and an embedding refiner. The token denoiser predicts the first two groups of clean acoustic tokens from the noisy ones, which can serve as the acoustic prompt for LauraTTS to synthesize high-quality personalized speech or be converted to clean speech waveforms through the embedding refiner and codec decoder. Experimental results show that our proposed codec denoiser outperforms state-of-the-art speech enhancement (SE)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders