CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers
Xintong Wang, Chang Zeng, Jun Chen, Chunhui Wang

TL;DR
CrossSinger is a novel cross-lingual singing voice synthesis system that achieves high fidelity and multi-singer capabilities by unifying language representations and removing singer biases, enabling effective synthesis across multiple languages including code-switch scenarios.
Contribution
The paper introduces CrossSinger, a new model that uses IPA-based representation, conditional layer normalization, and GRL to enable cross-lingual, multi-singer high-fidelity singing synthesis from monolingual data.
Findings
Successfully synthesizes high-fidelity singing voices across multiple languages.
Demonstrates effective handling of code-switch singing scenarios.
Reduces singer bias in monolingual training data.
Abstract
It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-lingual ability by only using monolingual singers in the training stage. In this paper, we propose CrossSinger, which is a cross-lingual singing voice synthesizer based on Xiaoicesing2. Specifically, we utilize International Phonetic Alphabet to unify the representation for all languages of the training data. Moreover, we leverage conditional layer normalization to incorporate the language information into the model for better pronunciation when singers meet unseen languages. Additionally, gradient reversal layer (GRL) is utilized to remove singer biases included in lyrics since all singers are monolingual, which indicates singer's identity is implicitly associated with the text. The experiment is conducted on a combination of three singing voice datasets containing Japanese Kiritan…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
