SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset

Yicheng Gu; Chaoren Wang; Junan Zhang; Xueyao Zhang; Zihao Fang; Haorui He; Zhizheng Wu

arXiv:2505.09325·cs.SD·May 15, 2025

SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset

Yicheng Gu, Chaoren Wang, Junan Zhang, Xueyao Zhang, Zihao Fang, Haorui He, Zhizheng Wu

PDF

Open Access 4 Reviews

TL;DR

SingNet introduces a large, diverse, and in-the-wild singing voice dataset of 3000 hours, enabling advancements in singing voice synthesis, conversion, and related applications through open-source models and benchmark evaluations.

Contribution

The paper presents a novel large-scale, diverse singing voice dataset and demonstrates its utility by training and benchmarking state-of-the-art models for various singing voice tasks.

Findings

01

Successful creation of a 3000-hour diverse singing voice dataset

02

Open-source pre-trained models for singing voice synthesis and conversion

03

Benchmark results showing effectiveness of the dataset in singing voice tasks

Abstract

The lack of a publicly-available large-scale and diverse dataset has long been a significant bottleneck for singing voice applications like Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). To tackle this problem, we present SingNet, an extensive, diverse, and in-the-wild singing voice dataset. Specifically, we propose a data processing pipeline to extract ready-to-use training data from sample packs and songs on the internet, forming 3000 hours of singing voices in various languages and styles. Furthermore, to facilitate the use and demonstrate the effectiveness of SingNet, we pre-train and open-source various state-of-the-art (SOTA) models on Wav2vec2, BigVGAN, and NSF-HiFiGAN based on our collected singing voice data. We also conduct benchmark experiments on Automatic Lyric Transcription (ALT), Neural Vocoder, and Singing Voice Conversion (SVC). Audio demos are…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 5

Strengths

1. The authors claim to release the first and largest diverse dataset for singing voice applications. If they eventually open-source the data and processing code, it will be a tremendous contribution to the academic community. 2. The authors propose a data preprocessing pipeline designed for in-the-wild data, addressing important issues such as audio quality reconstruction, data filtering, etc.

Weaknesses

1. The authors claim that the proposed dataset can boost the development of singing voice synthesis (SVS). However, the common definition of SVS is synthesizing singing voices from lyrics and melody annotations (such as MIDIs or F0s). There is no evidence or verification in this paper to support that the proposed dataset supports such SVS tasks. 2. A similar concern, in lines 33-35, the authors claim that the way ACEStudio creates singing voices is manpower-consuming and inconvenient for scalin

Reviewer 02Rating 6Confidence 3

Strengths

* The proposed dataset surpasses previous one in both quality and diversity, facilitating related research significantly. * The authors validate the effectiveness of their dataset across multiple tasks, enhancing the persuasive value of its application.

Weaknesses

* The statistics of the dataset can be more fine-grained, such as the distribution of singer gender, the pitch distribution of each gender, and the sub-distributions of each language and style. * Despite possible difficulties of doing this on large datasets, the lack of fine-grained MIDI and phone duration annotations makes it relatively challenging for applying this dataset to singing voice synthesis (SVS) tasks. Some MIDI notation and alignment models may help the annotation.

Reviewer 03Rating 5Confidence 3

Strengths

This paper is pioneering in providing a data processing pipeline for in-the-wild singing voice data, offering valuable insights for the singing and music community, particularly in scaling up model data. Although the dataset is still relatively small compared to speech data, it has the potential to significantly encourage the growth of singing voice data in the future. Originality: The development of an open-source data processing pipeline for extracting singing voice data from online sources i

Weaknesses

1. Lack of Innovation: The paper closely resembles another paper's process, Emilia, without clearly articulating its improvements over Emilia. Additionally, the paper primarily leverages existing work from others, with minimal novel technical contributions. The methods related to neural vocoding and data augmentation could benefit from more comprehensive referencing to acknowledge the contributions of earlier studies. 2. The data processing pipeline is crucial but is not described with sufficie

Reviewer 04Rating 3Confidence 4

Strengths

This work conducts experiments in singing voice processing at an unprecedented scale, offering the research community valuable insights into model scalability and potential performance improvements.

Weaknesses

The paper's primary limitations revolve around dataset accessibility and the comprehensiveness of the experimental work. While large-scale datasets are invaluable for research communities, the copyrighted nature of this dataset (88.7% from commercial sample packs, the rest from copyrighted music) prevents its release. This significantly diminishes the novelty claim of the dataset itself, as its inaccessibility hinders broader research impact. The experiments lack sufficient breadth and depth in

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis