Mitigating Latent Mismatch in cVAE-Based Singing Voice Synthesis via Flow Matching
Minhyeok Yun, Yong-Hoon Choi

TL;DR
This paper introduces FM-Singer, a flow-matching framework that refines latent representations in cVAE-based singing voice synthesis, reducing mismatch between training and inference to improve expressive quality.
Contribution
It proposes a novel latent refinement method using flow matching and ODE integration to address latent mismatch in cVAE-based SVS without redesigning the decoder.
Findings
Improves objective metrics in singing voice synthesis
Enhances perceptual quality of generated singing voices
Maintains synthesis efficiency with latent refinement
Abstract
Singing voice synthesis (SVS) aims to generate natural and expressive singing waveforms from symbolic musical scores. In cVAE-based SVS, however, a mismatch arises because the decoder is trained with latent representations inferred from target singing signals, while inference relies on latent representations predicted only from conditioning inputs. This discrepancy can weaken fine expressive acoustic details in the synthesized output. To mitigate this issue, we propose FM-Singer, a flow-matching-based latent refinement framework for cVAE-based singing voice synthesis. Rather than redesigning the acoustic decoder, the proposed method learns a continuous vector field that transports inference-time latent samples toward posterior-like latent representations through ODE-based integration before waveform generation. Because the refinement is performed in latent space, the method remains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Voice and Speech Disorders
