Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track

Jose Giraldo; Alex Peir\'o-Lilja; Rodolfo Zevallos; Cristina Espa\~na-Bonet

arXiv:2602.05770·eess.AS·February 6, 2026

Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track

Jose Giraldo, Alex Peir\'o-Lilja, Rodolfo Zevallos, Cristina Espa\~na-Bonet

PDF

Open Access

TL;DR

This paper explores non-autoregressive models for zero-shot speech synthesis, incorporating enhanced audio prompts and multi-stage noise reduction to improve naturalness and robustness in wild speech scenarios.

Contribution

It introduces a novel combination of models and enhancement techniques for zero-shot TTS, emphasizing the impact of prompt quality and noise handling.

Findings

01

Enhanced audio prompts improve zero-shot synthesis quality.

02

Multi-stage noise reduction significantly boosts signal clarity.

03

Finetuning enhances robustness and naturalness.

Abstract

We evaluate two non-autoregressive architectures, StyleTTS2 and F5-TTS, to address the spontaneous nature of in-the-wild speech. Our models utilize flexible duration modeling to improve prosodic naturalness. To handle acoustic noise, we implement a multi-stage enhancement pipeline using the Sidon model, which significantly outperforms standard Demucs in signal quality. Experimental results show that finetuning enhanced audios yields superior robustness, achieving up to 4.21 UTMOS and 3.47 DNSMOS. Furthermore, we analyze the impact of reference prompt quality and length on zero-shot synthesis performance, demonstrating the effectiveness of our approach for realistic speech generation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research