FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

Hui Wang; Shujie Liu; Lingwei Meng; Jinyu Li; Yifan Yang; Shiwan Zhao; Haiyang Sun; Yanqing Liu; Haoqin Sun; Jiaming Zhou; Yan Lu; Yong Qin

arXiv:2502.11128·cs.CL·September 4, 2025

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin

PDF

Open Access

TL;DR

FELLE is a novel autoregressive speech synthesis model that combines language modeling with token-wise flow matching and a hierarchical coarse-to-fine approach to improve the quality and coherence of generated mel-spectrograms.

Contribution

It introduces a new autoregressive framework integrating flow matching with hierarchical coarse-to-fine generation for speech synthesis.

Findings

01

Significant improvements in TTS quality demonstrated.

02

Effective modeling of continuous-valued tokens with flow matching.

03

Enhanced temporal coherence and stability in synthesis.

Abstract

To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling