Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm
Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, and Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin, Jin

TL;DR
Muskits-ESPnet introduces a versatile, pretrained audio model-based toolkit for singing voice synthesis, enabling multi-format input, error correction, and perceptual evaluation, advancing the field with new paradigms.
Contribution
It presents Muskits-ESPnet, a comprehensive toolkit that applies pretrained audio models to SVS, supporting various data formats and incorporating error detection and perceptual evaluation.
Findings
Supports multi-format inputs for SVS models
Includes automatic error detection and correction
Features perception auto-evaluation module
Abstract
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at \url{https://github.com/espnet/espnet}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
