MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity
Sungjae Kim, Yewon Kim, Jewoo Jun, and Injung Kim

TL;DR
Muse-SVS is a novel multi-singer emotional singing voice synthesizer that controls emotional intensity through a unified embedding space, improving fidelity, expressiveness, and synchronization in synthesized singing.
Contribution
It introduces a joint embedding approach for multi-attribute control, a statistical pitch predictor, a context-aware duration predictor, and an ASPP-Transformer architecture for enhanced singing synthesis.
Findings
Improved fidelity and expressiveness over baseline models
Effective control of emotional intensity through embedding interpolation and extrapolation
Accurate synchronization with instrumental parts
Abstract
We propose a multi-singer emotional singing voice synthesizer, Muse-SVS, that expresses emotion at various intensity levels by controlling subtle changes in pitch, energy, and phoneme duration while accurately following the score. To control multiple style attributes while avoiding loss of fidelity and expressiveness due to interference between attributes, Muse-SVS represents all attributes and their relations together by a joint embedding in a unified embedding space. Muse-SVS can express emotional intensity levels not included in the training data through embedding interpolation and extrapolation. We also propose a statistical pitch predictor to express pitch variance according to emotional intensity, and a context-aware residual duration predictor to prevent the accumulation of variances in phoneme duration, which is crucial for synchronization with instrumental parts. In addition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
