TL;DR
RobustSpeechFlow enhances flow-matching text-to-speech models by introducing augmentation-based contrastive training, significantly improving alignment robustness, speech fidelity, and cross-speaker naturalness without external aligners.
Contribution
It proposes a novel augmentation-based contrastive flow matching method that improves TTS alignment robustness and speech quality without external tools.
Findings
Reduces word error rate from 1.44 to 1.38 on Seed-TTS-eval.
Decreases English CER from 0.48% to 0.35% on ZERO500.
Improves Korean CER from 0.81% to 0.57% on ZERO500.
Abstract
While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
