TL;DR
SpeechCraft is a large-scale, fine-grained expressive speech dataset with natural language descriptions, created using an automatic annotation system, to enhance speech style understanding and synthesis tasks.
Contribution
The paper introduces an automatic annotation system that generates detailed natural language descriptions for speech clips, enabling the creation of a large, diverse expressive speech dataset.
Findings
The dataset contains approximately 2,000 hours of speech data.
SpeechCraft improves performance in speech style understanding tasks.
It significantly boosts stylistic speech synthesis capabilities.
Abstract
Speech-language multi-modal learning presents a significant challenge due to the fine nuanced information inherent in speech styles. Therefore, a large-scale dataset providing elaborate comprehension of speech style is urgently needed to facilitate insightful interplay between speech audio and natural language. However, constructing such datasets presents a major trade-off between large-scale data collection and high-quality annotation. To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Initially, speech audios are processed by a series of expert classifiers and captioning models to capture diverse speech characteristics, followed by a fine-tuned LLaMA for customized annotation generation. Unlike previous tag/templet-based annotation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLLaMA
