Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track
Jose Giraldo, Alex Peir\'o-Lilja, Rodolfo Zevallos, Cristina Espa\~na-Bonet

TL;DR
This paper explores non-autoregressive models for zero-shot speech synthesis, incorporating enhanced audio prompts and multi-stage noise reduction to improve naturalness and robustness in wild speech scenarios.
Contribution
It introduces a novel combination of models and enhancement techniques for zero-shot TTS, emphasizing the impact of prompt quality and noise handling.
Findings
Enhanced audio prompts improve zero-shot synthesis quality.
Multi-stage noise reduction significantly boosts signal clarity.
Finetuning enhances robustness and naturalness.
Abstract
We evaluate two non-autoregressive architectures, StyleTTS2 and F5-TTS, to address the spontaneous nature of in-the-wild speech. Our models utilize flexible duration modeling to improve prosodic naturalness. To handle acoustic noise, we implement a multi-stage enhancement pipeline using the Sidon model, which significantly outperforms standard Demucs in signal quality. Experimental results show that finetuning enhanced audios yields superior robustness, achieving up to 4.21 UTMOS and 3.47 DNSMOS. Furthermore, we analyze the impact of reference prompt quality and length on zero-shot synthesis performance, demonstrating the effectiveness of our approach for realistic speech generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research
