MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity

Sungjae Kim; Yewon Kim; Jewoo Jun; and Injung Kim

arXiv:2203.00931·eess.AS·August 12, 2025·1 cites

MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity

Sungjae Kim, Yewon Kim, Jewoo Jun, and Injung Kim

PDF

Open Access

TL;DR

Muse-SVS is a novel multi-singer emotional singing voice synthesizer that controls emotional intensity through a unified embedding space, improving fidelity, expressiveness, and synchronization in synthesized singing.

Contribution

It introduces a joint embedding approach for multi-attribute control, a statistical pitch predictor, a context-aware duration predictor, and an ASPP-Transformer architecture for enhanced singing synthesis.

Findings

01

Improved fidelity and expressiveness over baseline models

02

Effective control of emotional intensity through embedding interpolation and extrapolation

03

Accurate synchronization with instrumental parts

Abstract

We propose a multi-singer emotional singing voice synthesizer, Muse-SVS, that expresses emotion at various intensity levels by controlling subtle changes in pitch, energy, and phoneme duration while accurately following the score. To control multiple style attributes while avoiding loss of fidelity and expressiveness due to interference between attributes, Muse-SVS represents all attributes and their relations together by a joint embedding in a unified embedding space. Muse-SVS can express emotional intensity levels not included in the training data through embedding interpolation and extrapolation. We also propose a statistical pitch predictor to express pitch variance according to emotional intensity, and a context-aware residual duration predictor to prevent the accumulation of variances in phoneme duration, which is crucial for synchronization with instrumental parts. In addition,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing