Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in   New Paradigm

Yuning Wu; Jiatong Shi; Yifeng Yu; Yuxun Tang; Tao Qian; and Yueqian Lin; Jionghao Han; Xinyi Bai; Shinji Watanabe; Qin; Jin

arXiv:2409.07226·cs.SD·October 14, 2024

Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, and Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin, Jin

PDF

Open Access 1 Repo

TL;DR

Muskits-ESPnet introduces a versatile, pretrained audio model-based toolkit for singing voice synthesis, enabling multi-format input, error correction, and perceptual evaluation, advancing the field with new paradigms.

Contribution

It presents Muskits-ESPnet, a comprehensive toolkit that applies pretrained audio models to SVS, supporting various data formats and incorporating error detection and perceptual evaluation.

Findings

01

Supports multi-format inputs for SVS models

02

Includes automatic error detection and correction

03

Features perception auto-evaluation module

Abstract

This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at \url{https://github.com/espnet/espnet}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

espnet/espnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing