FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

Dekun Chen; Xueyao Zhang; Yuancheng Wang; Kenan Dai; Li Ma; Zhizheng Wu

arXiv:2601.04656·cs.SD·January 9, 2026

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

Dekun Chen, Xueyao Zhang, Yuancheng Wang, Kenan Dai, Li Ma, Zhizheng Wu

PDF

Open Access 3 Reviews

TL;DR

FlexiVoice is a novel TTS system that enables flexible, zero-shot style control using natural language instructions and speech references, leveraging an LLM core and a progressive training scheme.

Contribution

It introduces a new Progressive Post-Training scheme and a multi-objective optimization approach for accurate, flexible style and timbre control in zero-shot TTS.

Findings

01

Outperforms baseline methods in style control accuracy

02

Demonstrates strong disentanglement of style, timbre, and content

03

Achieves high naturalness and robustness in human evaluations

Abstract

This study proposes FlexiVoice, a text-to-speech (TTS) synthesis system capable of flexible style control with zero-shot voice cloning. The speaking style is controlled by a natural-language instruction and the voice timbre is provided by a speech reference in zero-shot manner. FlexiVoice is built with an LLM core, which takes text as input, and also takes an optional natural language instruction and an optional speech reference to control style and timbre, respectively. FlexiVoice is equipped with a novel Progressive Post-Training (PPT) scheme that progressively unlocks accurate and flexible controllability. In particular, it first employs Direct Preference Optimization (DPO) to enable FlexiVoice to accurately follow both natural language instruction and speech reference simultaneously. It then uses a multi-objective Group Relative Policy Optimization (GRPO) to disentangle style…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. This work bridges instruction-based and zero-shot TTS research, demonstrating how LLM-based reinforcement learning can generalize across diverse speaking scenarios. The progressive reinforcement learning curriculum (PPT) is novel in the context of TTS controllability. The unified design for style–timbre disentanglement with both natural language and reference inputs is a meaningful advance beyond previous instruction-based or zero-shot models. 2. This work addresses a critical and challengin

Weaknesses

1. The decoupling evaluation is heavily focused on emotional control. While emotions are a key aspect of style, the paper would be strengthened by a more direct analysis of other stylistic aspects (e.g., speaking rate, pitch range, informal tone) to fully validate the "any style" claim. A qualitative analysis or case studies on non-emotional instructions would be valuable. 2. The progressive training scheme involving multiple stages of DPO and GRPO is computationally intensive. The paper does

Reviewer 02Rating 4Confidence 4

Strengths

1. The novel proposal of curriculum learning framework with DPO + 2 stage GRPO achieve the target to control (DPO initial alignment) and GRPO disentanglement and generalization. The PPT schema methodologies are successfully applied to achieve the goal which demonstrate the alignment strategies progressively to realize the controllability and disentangle style instruction, reference timbre, and text contextual content. 2. The instruct dataset construction and evaluation set design as text-only

Weaknesses

1. The paper claims "can speak in any style with any voice", however, the instruction motioned as only 5 emotion as neutral, happy, angry, sad and surprised in training and benchmark (TO and TR). There evaluation seems don't clearly demonstrate any other emotion or other natural language style instruction or conflict voice timbre in the following subjective test which make the title is a little overclaimed. 2. For the evaluation, only first evaluation set involve human judge as CMOS test, and

Reviewer 03Rating 6Confidence 4

Strengths

- The paper presents an extensive and well-engineered system that integrates multiple components and datasets into a cohesive framework, which is commendable. - It targets the problem of style-universal speech synthesis, addressing the need for flexible, controllable, and expressive TTS generation. - The proposed prompt-based approach allows conditioning on various prosodic or emotional cues, contributing to improved speech diversity and controllability. - The work offers thorough evaluations

Weaknesses

### **Problem & Motivation** - The motivation could be strengthened by explaining how this approach differs from recent prompt-based or diffusion-based TTS systems. The overall goal overlaps with several existing works such as FlexiVoice, PromptTTS, and StyleSpeech, making it difficult to isolate what is novel. - The paper mentions disentangling timbre and style, but it is unclear how this disentanglement is achieved or guaranteed. Without a clear mathematical constraint or empirical validatio

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music Technology and Sound Studies · Voice and Speech Disorders