SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset
Yicheng Gu, Chaoren Wang, Junan Zhang, Xueyao Zhang, Zihao Fang, Haorui He, Zhizheng Wu

TL;DR
SingNet introduces a large, diverse, and in-the-wild singing voice dataset of 3000 hours, enabling advancements in singing voice synthesis, conversion, and related applications through open-source models and benchmark evaluations.
Contribution
The paper presents a novel large-scale, diverse singing voice dataset and demonstrates its utility by training and benchmarking state-of-the-art models for various singing voice tasks.
Findings
Successful creation of a 3000-hour diverse singing voice dataset
Open-source pre-trained models for singing voice synthesis and conversion
Benchmark results showing effectiveness of the dataset in singing voice tasks
Abstract
The lack of a publicly-available large-scale and diverse dataset has long been a significant bottleneck for singing voice applications like Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). To tackle this problem, we present SingNet, an extensive, diverse, and in-the-wild singing voice dataset. Specifically, we propose a data processing pipeline to extract ready-to-use training data from sample packs and songs on the internet, forming 3000 hours of singing voices in various languages and styles. Furthermore, to facilitate the use and demonstrate the effectiveness of SingNet, we pre-train and open-source various state-of-the-art (SOTA) models on Wav2vec2, BigVGAN, and NSF-HiFiGAN based on our collected singing voice data. We also conduct benchmark experiments on Automatic Lyric Transcription (ALT), Neural Vocoder, and Singing Voice Conversion (SVC). Audio demos are…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The authors claim to release the first and largest diverse dataset for singing voice applications. If they eventually open-source the data and processing code, it will be a tremendous contribution to the academic community. 2. The authors propose a data preprocessing pipeline designed for in-the-wild data, addressing important issues such as audio quality reconstruction, data filtering, etc.
1. The authors claim that the proposed dataset can boost the development of singing voice synthesis (SVS). However, the common definition of SVS is synthesizing singing voices from lyrics and melody annotations (such as MIDIs or F0s). There is no evidence or verification in this paper to support that the proposed dataset supports such SVS tasks. 2. A similar concern, in lines 33-35, the authors claim that the way ACEStudio creates singing voices is manpower-consuming and inconvenient for scalin
* The proposed dataset surpasses previous one in both quality and diversity, facilitating related research significantly. * The authors validate the effectiveness of their dataset across multiple tasks, enhancing the persuasive value of its application.
* The statistics of the dataset can be more fine-grained, such as the distribution of singer gender, the pitch distribution of each gender, and the sub-distributions of each language and style. * Despite possible difficulties of doing this on large datasets, the lack of fine-grained MIDI and phone duration annotations makes it relatively challenging for applying this dataset to singing voice synthesis (SVS) tasks. Some MIDI notation and alignment models may help the annotation.
This paper is pioneering in providing a data processing pipeline for in-the-wild singing voice data, offering valuable insights for the singing and music community, particularly in scaling up model data. Although the dataset is still relatively small compared to speech data, it has the potential to significantly encourage the growth of singing voice data in the future. Originality: The development of an open-source data processing pipeline for extracting singing voice data from online sources i
1. Lack of Innovation: The paper closely resembles another paper's process, Emilia, without clearly articulating its improvements over Emilia. Additionally, the paper primarily leverages existing work from others, with minimal novel technical contributions. The methods related to neural vocoding and data augmentation could benefit from more comprehensive referencing to acknowledge the contributions of earlier studies. 2. The data processing pipeline is crucial but is not described with sufficie
This work conducts experiments in singing voice processing at an unprecedented scale, offering the research community valuable insights into model scalability and potential performance improvements.
The paper's primary limitations revolve around dataset accessibility and the comprehensiveness of the experimental work. While large-scale datasets are invaluable for research communities, the copyrighted nature of this dataset (88.7% from commercial sample packs, the rest from copyrighted music) prevents its release. This significantly diminishes the novelty claim of the dataset itself, as its inaccessibility hinders broader research impact. The experiments lack sufficient breadth and depth in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
