NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Qinke Ni; Huan Liao; Dekun Chen; Yuxiang Wang; Zhizheng Wu

arXiv:2603.15352·cs.SD·March 19, 2026

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Qinke Ni, Huan Liao, Dekun Chen, Yuxiang Wang, Zhizheng Wu

PDF

Open Access

TL;DR

NV-Bench is a comprehensive benchmark for evaluating nonverbal vocalization synthesis in expressive TTS, using standardized metrics and a diverse dataset to improve assessment reliability.

Contribution

It introduces the first functional taxonomy-based benchmark with dual evaluation metrics and a multilingual dataset for NV synthesis in TTS systems.

Findings

01

Strong correlation between objective metrics and human perception.

02

Benchmark enables standardized evaluation of NV synthesis models.

03

Provides baseline results for diverse TTS models.

Abstract

While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Emotion and Mood Recognition