RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Jinhyeok Yang; Hyeongju Kim; Yechan Yu; Joon Byun; Frederik Bous; Juheon Lee

arXiv:2605.22083·cs.SD·May 22, 2026

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Jinhyeok Yang, Hyeongju Kim, Yechan Yu, Joon Byun, Frederik Bous, Juheon Lee

PDF

1 Repo

TL;DR

RobustSpeechFlow enhances flow-matching text-to-speech models by introducing augmentation-based contrastive training, significantly improving alignment robustness, speech fidelity, and cross-speaker naturalness without external aligners.

Contribution

It proposes a novel augmentation-based contrastive flow matching method that improves TTS alignment robustness and speech quality without external tools.

Findings

01

Reduces word error rate from 1.44 to 1.38 on Seed-TTS-eval.

02

Decreases English CER from 0.48% to 0.35% on ZERO500.

03

Improves Korean CER from 0.81% to 0.57% on ZERO500.

Abstract

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://robustspeechflow.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.