DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage
Kyra Wang, Dorien Herremans

TL;DR
This paper introduces DisfluencySpeech, a high-quality English speech dataset with paralanguage, enabling better training of conversational TTS models that include non-lexical sounds like laughing and sighing.
Contribution
It provides a single-speaker, studio-quality dataset with annotated paralanguage and multiple transcripts, facilitating the development of TTS systems that generate expressive, disfluent speech.
Findings
Benchmark TTS models trained on different transcript levels.
Dataset includes nearly 10 hours of expressive speech.
Simulates realistic informal conversations.
Abstract
Laughing, sighing, stuttering, and other forms of paralanguage do not contribute any direct lexical meaning to speech, but they provide crucial propositional context that aids semantic and pragmatic processes such as irony. It is thus important for artificial social agents to both understand and be able to generate speech with semantically-important paralanguage. Most speech datasets do not include transcribed non-lexical speech sounds and disfluencies, while those that do are typically multi-speaker datasets where each speaker provides relatively little audio. This makes it challenging to train conversational Text-to-Speech (TTS) synthesis models that include such paralinguistic components. We thus present DisfluencySpeech, a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
