VoiceSculptor: Your Voice, Designed By You

Jingbin Hu; Huakang Chen; Linhan Ma; Dake Guo; Qirui Zhan; Wenhao Li; Haoyu Zhang; Kangxiang Xia; Ziyu Zhang; Wenjie Tian; Chengyou Wang; Jinrui Liang; Shuhan Guo; Zihang Yang; Bengu Wu; Binbin Zhang; Pengcheng Zhu; Pengyuan Xie; Chuan Xie; Qiang Zhang; Jie Liu; Lei Xie

arXiv:2601.10629·eess.AS·January 21, 2026

VoiceSculptor: Your Voice, Designed By You

Jingbin Hu, Huakang Chen, Linhan Ma, Dake Guo, Qirui Zhan, Wenhao Li, Haoyu Zhang, Kangxiang Xia, Ziyu Zhang, Wenjie Tian, Chengyou Wang, Jinrui Liang, Shuhan Guo, Zihang Yang, Bengu Wu, Binbin Zhang, Pengcheng Zhu, Pengyuan Xie, Chuan Xie, Qiang Zhang, Jie Liu, Lei Xie

PDF

Open Access

TL;DR

VoiceSculptor is an open-source system that enables fine-grained, instruction-based control over speech attributes and high-fidelity voice cloning, advancing reproducible research in controllable TTS.

Contribution

It introduces a unified framework combining instruction-based voice design with high-fidelity voice cloning, supporting iterative refinement and attribute-level editing.

Findings

01

Achieves state-of-the-art results on InstructTTSEval-Zh.

02

Supports controllable speaker timbre from natural language descriptions.

03

Fully open-sourced with code and pretrained models.

Abstract

Despite rapid progress in text-to-speech (TTS), open-source systems still lack truly instruction-following, fine-grained control over core speech attributes (e.g., pitch, speaking rate, age, emotion, and style). We present VoiceSculptor, an open-source unified system that bridges this gap by integrating instruction-based voice design and high-fidelity voice cloning in a single framework. It generates controllable speaker timbre directly from natural-language descriptions, supports iterative refinement via Retrieval-Augmented Generation (RAG), and provides attribute-level edits across multiple dimensions. The designed voice is then rendered into a prompt waveform and fed into a cloning model to enable high-fidelity timbre transfer for downstream speech synthesis. VoiceSculptor achieves open-source state-of-the-art (SOTA) on InstructTTSEval-Zh, and is fully open-sourced, including code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Topic Modeling