SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models
Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin

TL;DR
SingOMD is a novel method that extracts singing-oriented multi-resolution discrete representations from speech SSL models, improving singing voice synthesis by addressing domain gaps and requiring refined features.
Contribution
It introduces a new approach to adapt speech SSL features for singing, incorporating multi-resolution modules and discretization for enhanced singing voice generation.
Findings
Effective in singing vocoders and voice synthesis
Robust and efficient representation extraction
Improves quality of singing voice synthesis
Abstract
Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation than typical speech. To address these challenges, we introduce SingOMD, a novel method to extract singing-oriented multi-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task and incorporate multi-resolution modules based on resampling to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
