DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with   Paralanguage

Kyra Wang; Dorien Herremans

arXiv:2406.08820·eess.AS·June 14, 2024

DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage

Kyra Wang, Dorien Herremans

PDF

Open Access 1 Datasets

TL;DR

This paper introduces DisfluencySpeech, a high-quality English speech dataset with paralanguage, enabling better training of conversational TTS models that include non-lexical sounds like laughing and sighing.

Contribution

It provides a single-speaker, studio-quality dataset with annotated paralanguage and multiple transcripts, facilitating the development of TTS systems that generate expressive, disfluent speech.

Findings

01

Benchmark TTS models trained on different transcript levels.

02

Dataset includes nearly 10 hours of expressive speech.

03

Simulates realistic informal conversations.

Abstract

Laughing, sighing, stuttering, and other forms of paralanguage do not contribute any direct lexical meaning to speech, but they provide crucial propositional context that aids semantic and pragmatic processes such as irony. It is thus important for artificial social agents to both understand and be able to generate speech with semantically-important paralanguage. Most speech datasets do not include transcribed non-lexical speech sounds and disfluencies, while those that do are typically multi-speaker datasets where each speaker provides relatively little audio. This makes it challenging to train conversational Text-to-Speech (TTS) synthesis models that include such paralinguistic components. We thus present DisfluencySpeech, a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

amaai-lab/DisfluencySpeech
dataset· 289 dl
289 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems