UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

Wenming Tu; Guanrou Yang; Ruiqi Yan; Wenxi Chen; Ziyang Ma; Yipeng Kang; Kai Yu; Xie Chen; Zilong Zheng

arXiv:2510.22588·eess.AS·October 28, 2025

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

Wenming Tu, Guanrou Yang, Ruiqi Yan, Wenxi Chen, Ziyang Ma, Yipeng Kang, Kai Yu, Xie Chen, Zilong Zheng

PDF

1 Models 1 Datasets 4 Reviews

TL;DR

UltraVoice introduces a large-scale speech dialogue dataset enabling fine-grained control over speech styles, significantly improving speech model controllability and maintaining core conversational abilities.

Contribution

The paper presents UltraVoice, a novel dataset for multi-dimensional speech style control, and demonstrates its effectiveness in enhancing speech dialogue models and controllable TTS systems.

Findings

01

Significant improvements in MOS and IFR scores after fine-tuning on UltraVoice.

02

Enhanced core understanding and reasoning in dialogue models.

03

Dataset enables high-quality expressive speech synthesis.

Abstract

Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

- The paper targets an important gap by enabling fine-grained style control across six dimensions and provides the first large-scale dialogue dataset designed for this purpose. - The dataset construction pipeline is thorough, with explicit filtering via CER and audio duration for quality enhancement. - Fine-tuning on UltraVoice raises instruction following and subjective naturalness across models and across most style dimensions. General conversational ability also improves on URO-Bench.

Weaknesses

- The corpus is fully synthetic, built with GPT-4o and several TTS or VC systems, which may introduce artifacts and reduce diversity. - Some evaluations are done by an ALM judge, which raises concerns about reproducibility and potential bias in subjective scores. A stronger human evaluation component would help. - Since the authors argue that existing controllable TTS data are not suitable for fine-tuning dialogue models, the paper should add comparative fine-tuning results using those datasets.

Reviewer 02Rating 4Confidence 4

Strengths

1. UltraVoice includes six control dimensions (Emotion, Speed, Volume, Accent, Language, Composite), with instruction–response dialogue format; 100,770 dialogues (~833 hours). Authors positioned it as the first dataset explicitly designed for fine-grained, multi-dimension control in end-to-end voice agents. 2. By doing SFT on UltraVoice, fine-tuned models shows Instruction-Following Rate (IFR) improves by +14.61 to +40.09 percentage points across backbones and sizes; Mean Opinion Score (MOS) imp

Weaknesses

1. User prompts come from varied speakers/noise, but system replies use a single voice, and some Accent samples require voice conversion after TTS. This may cap speaker diversity and introduce voice conversion artifacts, creating a domain gap to real human recordings, and might influence downstream model's understanding of human voice features. 2. The Language dimension (Chinese/Korean/Japanese) is harder: authors claimed that LLaMA-based models show smaller gains or regressions in MOS/IFR there

Reviewer 03Rating 2Confidence 4

Strengths

Here are two major strengths from my perspective. 1) It constructs a speech spoken dialogue dataset in 830 hours and with rich fine-grained styles and instructions for conversation. It also controls the quality by criterion to filtering by recognition. The paper provides clear instruction of how to build it step by step. It identifies current limitation of current spoken dialogue data lack of style control. 2) It assets the dataset with detail statistics across different dimension and experime

Weaknesses

Major weaknesses are 1. Lack of innovation on research perspective either algorithm or methodologies. The lack of ability for fine-grained speech style control is a challenge, but this building from existing conversation corpus with GPT-like model instruction inject and generated speech from various TTS/voice conversion models would be a little too simple and artificial, not real interaction data. The generation process could not simulate the real interaction like real scenario, like the real r

Reviewer 04Rating 2Confidence 5

Strengths

- The study identifies the important point of poor expressiveness in speech interaction within the field of speech dialogue and constructs a large-scale, stylistically diverse speech dialogue dataset. - The dialogue model fine-tuned with this dataset demonstrates enhanced expressiveness in responses while retaining its foundational capabilities. This indicates its potential to effectively enhance the performance of existing dialogue systems.

Weaknesses

- The subjective metrics (MOS and IFR) are obtained by Gemini-2.5-flash. But in the reference mentioned in line 336 uses Gemini-2.5-pro. There is a gap between these two model in speech-related judgments. So the subjective scores are not verified to be consistency to humans, which diminishes the persuasiveness of the evaluation results. - All data is generated by existing TTS models, particularly the crucial response component. This implies that the upper limit of stylistic diversity and natural

Code & Models

Models

🤗
tutu0604/UltraVoice-SFT
model· 6 dl
6 dl

Datasets

tutu0604/UltraVoice
dataset· 286 dl
286 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.