MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Xinyue Yu; Youqing Fang; Pingyu Wu; Guoyang Ye; Wenbo Zhou; Weiming Zhang; Song Xiao

arXiv:2511.12074·cs.SD·November 20, 2025

MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Xinyue Yu, Youqing Fang, Pingyu Wu, Guoyang Ye, Wenbo Zhou, Weiming Zhang, Song Xiao

PDF

Open Access 1 Video

TL;DR

MF-Speech introduces a novel framework that disentangles speech factors and enables fine-grained, compositional control in speech generation, significantly improving quality and controllability over previous methods.

Contribution

The paper proposes MF-Speech, a new framework with factor purification and hierarchical control, advancing speech generation by enabling precise, multi-factor manipulation and transferability of speech representations.

Findings

01

Outperforms state-of-the-art methods in multi-factor speech generation

02

Achieves lower word error rate (WER=4.67%) and higher subjective scores

03

Demonstrates strong transferability of learned speech factors

Abstract

Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement· underline

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Face recognition and analysis