Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS

Haoyu Li; Mingyang Han; Yu Xi; Dongxiao Wang; Hankun Wang; Haoxiang Shi; Boyu Li; Jun Song; Bo Zheng; Shuai Wang; Kai Yu

arXiv:2511.09995·eess.AS·March 18, 2026

Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS

Haoyu Li, Mingyang Han, Yu Xi, Dongxiao Wang, Hankun Wang, Haoxiang Shi, Boyu Li, Jun Song, Bo Zheng, Shuai Wang, Kai Yu

PDF

Open Access

TL;DR

This paper introduces TLA-SA, a novel adaptive alignment method that improves speaker similarity in flow-matching zero-shot TTS by addressing non-uniform speaker information distribution across time and layers.

Contribution

The paper presents TLA-SA, a new adaptive alignment strategy that enhances speaker consistency in FM-based zero-shot TTS systems, addressing the underexplored speaker representation issue.

Findings

01

TLA-SA significantly improves speaker similarity in TTS systems.

02

The method generalizes well across different model architectures.

03

Experimental results show robustness on large-scale datasets.

Abstract

Flow-Matching (FM)-based zero-shot text-to-speech (TTS) systems exhibit high-quality speech synthesis and robust generalization capabilities. However, the speaker representation ability of such systems remains underexplored, primarily due to the lack of explicit speaker-specific supervision in the FM framework. To this end, we conduct an empirical analysis of speaker information distribution and reveal its non-uniform allocation across time steps and network layers, underscoring the need for adaptive speaker alignment. Accordingly, we propose Time-Layer Adaptive Speaker Alignment (TLA-SA), a strategy that enhances speaker consistency by jointly leveraging temporal and hierarchical variations. Experimental results show that TLA-SA substantially improves speaker similarity over baseline systems on both research- and industrial-scale datasets and generalizes well across diverse model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing