Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track
June Young Yi, Hyeongju Kim, Juheon Lee

TL;DR
This paper introduces a robust TTS training method using Self-Purifying Flow Matching to adapt an open-weight TTS model for in-the-wild speech, achieving top performance in the WildSpoof Challenge.
Contribution
The paper proposes a novel fine-tuning approach with SPFM that improves robustness of open-weight TTS models to real-world noisy speech conditions.
Findings
Achieved lowest Word Error Rate among competitors.
Ranked second in perceptual quality metrics.
Demonstrated effective adaptation of open-weight models to diverse speech data.
Abstract
This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
