NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

Liumeng Xue; Weizhen Bian; Jiahao Pan; Wenxuan Wang; Yilin Ren; Boyi Kang; Jingbin Hu; Ziyang Ma; Shuai Wang; Xinyuan Qian; Hung-yi Lee; Yike Guo

arXiv:2604.16211·cs.SD·April 22, 2026

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

Liumeng Xue, Weizhen Bian, Jiahao Pan, Wenxuan Wang, Yilin Ren, Boyi Kang, Jingbin Hu, Ziyang Ma, Shuai Wang, Xinyuan Qian, Hung-yi Lee, Yike Guo

PDF

TL;DR

NVBench is a comprehensive bilingual benchmark for evaluating speech synthesis systems' ability to generate and control non-verbal vocalizations like laughs and sighs, addressing a key gap in speech naturalness assessment.

Contribution

The paper introduces NVBench, a standardized bilingual benchmark with a multi-axis protocol for assessing NVV generation, control, and salience in speech synthesis systems.

Findings

01

NVV controllability often decouples from speech quality.

02

Low-SNR oral cues and long-duration NVVs are persistent challenges.

03

NVBench enables fair comparison of diverse speech synthesis systems.

Abstract

Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.