ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

Suhua Wang; Zifan Wang; Xiaoxin Sun; D. J. Wang; Zhanbo Liu; Xin Li

arXiv:2512.22491·cs.CL·December 30, 2025

ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

Suhua Wang, Zifan Wang, Xiaoxin Sun, D. J. Wang, Zhanbo Liu, Xin Li

PDF

Open Access

TL;DR

ManchuTTS introduces a hierarchical, flow-based speech synthesis model tailored for the endangered Manchu language, effectively addressing data scarcity and linguistic complexity to produce high-quality speech.

Contribution

This work presents the first Manchu TTS dataset, a hierarchical text representation, and a novel flow-matching Transformer model with hierarchical contrastive loss for agglutinative language synthesis.

Findings

01

Achieved a MOS of 4.52 with limited training data

02

Hierarchical guidance improves pronunciation accuracy by 31%

03

Prosodic naturalness increased by 27%

Abstract

As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu's linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flow-matching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Voice and Speech Disorders