NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech

Maksim Borisov; Egor Spirin; Daria Diatlova

arXiv:2507.13155·cs.LG·July 18, 2025

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech

Maksim Borisov, Egor Spirin, Daria Diatlova

PDF

Open Access 1 Models 1 Datasets

TL;DR

NonverbalTTS introduces a comprehensive, annotated dataset of nonverbal vocalizations with emotion labels to enhance expressive speech synthesis models, enabling better modeling of diverse NVs in TTS systems.

Contribution

The paper presents a new open-source dataset with diverse NVs and emotions, along with a pipeline for annotation and demonstrates improved TTS performance using this dataset.

Findings

01

Fine-tuning TTS models on NVTTS achieves parity with closed-source systems.

02

The dataset improves modeling of nonverbal vocalizations in TTS.

03

Automated detection and human validation ensure high-quality annotations.

Abstract

Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories. The dataset is derived from popular sources, VoxCeleb and Expresso, using automated detection followed by human validation. We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators. Fine-tuning open-source text-to-speech (TTS) models on the NVTTS dataset achieves parity with closed-source systems such as CosyVoice2, as measured by both human evaluation and automatic metrics, including speaker similarity and NV fidelity. By releasing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yasserrmd/SparkNV-Voice
model· 8 dl· ♡ 3
8 dl♡ 3

Datasets

deepvk/NonverbalTTS
dataset· 177 dl
177 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Communication and Language · Speech and dialogue systems