Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS
Haoyu Li, Mingyang Han, Yu Xi, Dongxiao Wang, Hankun Wang, Haoxiang Shi, Boyu Li, Jun Song, Bo Zheng, Shuai Wang, Kai Yu

TL;DR
This paper introduces TLA-SA, a novel adaptive alignment method that improves speaker similarity in flow-matching zero-shot TTS by addressing non-uniform speaker information distribution across time and layers.
Contribution
The paper presents TLA-SA, a new adaptive alignment strategy that enhances speaker consistency in FM-based zero-shot TTS systems, addressing the underexplored speaker representation issue.
Findings
TLA-SA significantly improves speaker similarity in TTS systems.
The method generalizes well across different model architectures.
Experimental results show robustness on large-scale datasets.
Abstract
Flow-Matching (FM)-based zero-shot text-to-speech (TTS) systems exhibit high-quality speech synthesis and robust generalization capabilities. However, the speaker representation ability of such systems remains underexplored, primarily due to the lack of explicit speaker-specific supervision in the FM framework. To this end, we conduct an empirical analysis of speaker information distribution and reveal its non-uniform allocation across time steps and network layers, underscoring the need for adaptive speaker alignment. Accordingly, we propose Time-Layer Adaptive Speaker Alignment (TLA-SA), a strategy that enhances speaker consistency by jointly leveraging temporal and hierarchical variations. Experimental results show that TLA-SA substantially improves speaker similarity over baseline systems on both research- and industrial-scale datasets and generalizes well across diverse model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing
