NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

Huan Liao; Qinke Ni; Yuancheng Wang; Yiheng Lu; Haoyue Zhan; Pengyuan Xie; Qiang Zhang; Zhizheng Wu

arXiv:2508.04195·cs.SD·August 7, 2025

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

Huan Liao, Qinke Ni, Yuancheng Wang, Yiheng Lu, Haoyue Zhan, Pengyuan Xie, Qiang Zhang, Zhizheng Wu

PDF

TL;DR

NVSpeech introduces a scalable pipeline that recognizes and synthesizes paralinguistic vocalizations in speech, enabling more natural and expressive Mandarin speech systems through dataset creation, joint recognition, and controllable synthesis.

Contribution

It presents the first large-scale, word-level annotated Mandarin dataset and a unified model for recognizing and controlling paralinguistic vocalizations in speech.

Findings

01

Created a dataset of 48,430 utterances with 18 paralinguistic categories

02

Developed a paralinguistic-aware ASR model for joint transcription

03

Fine-tuned TTS models for explicit control of vocalizations

Abstract

Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.