A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding
Runchuan Ye, Yixuan Zhou, Renjie Yu, Zijian Lin, Kehan Li, Xiang Li, Xin Liu, Guoyang Zeng, Zhiyong Wu

TL;DR
This paper introduces a scalable, automatic annotation framework for non-verbal vocalizations in speech, creating a large dataset that improves NVs generation and understanding in speech systems.
Contribution
The paper presents a novel, low-cost, scalable method for annotating non-verbal vocalizations and releases a large, diverse dataset for research in NVs.
Findings
The dataset enables better controllability in NVs generation.
The framework achieves high accuracy in detecting NVs in natural speech.
NVs understanding performance is comparable to existing methods.
Abstract
Non-verbal Vocalizations (NVs), such as laughter and sighs, are vital for conveying emotion and intention in human speech, yet most existing speech systems neglect them, which severely compromises communicative richness and emotional intelligence. Existing methods for NVs acquisition are either costly and unscalable (relying on manual annotation/recording) or unnatural (relying on rule-based synthesis). To address these limitations, we propose a highly scalable automatic annotation framework to label non-verbal phenomena from natural speech, which is low-cost, easily extendable, and inherently diverse and natural. This framework leverages a unified detection model to accurately identify NVs in natural speech and integrates them with transcripts via temporal-semantic alignment method. Using this framework, we created and released \textbf{NonVerbalSpeech-38K}, a diverse, real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
