SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural   Language Description

Zeyu Jin; Jia Jia; Qixin Wang; Kehan Li; Shuoyi Zhou; Songtao Zhou,; Xiaoyu Qin; Zhiyong Wu

arXiv:2408.13608·cs.MM·August 28, 2024

SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description

Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou,, Xiaoyu Qin, Zhiyong Wu

PDF

1 Repo

TL;DR

SpeechCraft is a large-scale, fine-grained expressive speech dataset with natural language descriptions, created using an automatic annotation system, to enhance speech style understanding and synthesis tasks.

Contribution

The paper introduces an automatic annotation system that generates detailed natural language descriptions for speech clips, enabling the creation of a large, diverse expressive speech dataset.

Findings

01

The dataset contains approximately 2,000 hours of speech data.

02

SpeechCraft improves performance in speech style understanding tasks.

03

It significantly boosts stylistic speech synthesis capabilities.

Abstract

Speech-language multi-modal learning presents a significant challenge due to the fine nuanced information inherent in speech styles. Therefore, a large-scale dataset providing elaborate comprehension of speech style is urgently needed to facilitate insightful interplay between speech audio and natural language. However, constructing such datasets presents a major trade-off between large-scale data collection and high-quality annotation. To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Initially, speech audios are processed by a series of expert classifiers and captioning models to capture diverse speech characteristics, followed by a fine-tuned LLaMA for customized annotation generation. Unlike previous tag/templet-based annotation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thuhcsi/speechcraft
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLLaMA