InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Kexin Huang; Qian Tu; Liwei Fan; Chenchen Yang; Dong Zhang; Shimin Li; Zhaoye Fei; Qinyuan Cheng; Xipeng Qiu

arXiv:2506.16381·cs.CL·June 23, 2025

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

InstructTTSEval introduces a comprehensive benchmark with tasks and metrics to evaluate complex natural-language instruction following in text-to-speech systems, aiming to improve flexibility and accuracy.

Contribution

The paper presents InstructTTSEval, a new benchmark with diverse tasks and an automatic evaluation method for assessing instruction-following in TTS models.

Findings

01

Current TTS systems show significant room for improvement in instruction following.

02

The benchmark includes English and Chinese datasets with 6,000 test cases.

03

Automatic evaluation with Gemini provides consistent assessment of instruction adherence.

Abstract

In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kexinhuang19/instructttseval
noneOfficial

Models

🤗
ASLP-lab/VoiceSculptor-VD
model· 54 dl· ♡ 18
54 dl♡ 18

Datasets

CaasiHUANG/InstructTTSEval
dataset· 552 dl
552 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research