Mitigating Latent Mismatch in cVAE-Based Singing Voice Synthesis via Flow Matching

Minhyeok Yun; Yong-Hoon Choi

arXiv:2601.00217·cs.SD·March 16, 2026

Mitigating Latent Mismatch in cVAE-Based Singing Voice Synthesis via Flow Matching

Minhyeok Yun, Yong-Hoon Choi

PDF

Open Access

TL;DR

This paper introduces FM-Singer, a flow-matching framework that refines latent representations in cVAE-based singing voice synthesis, reducing mismatch between training and inference to improve expressive quality.

Contribution

It proposes a novel latent refinement method using flow matching and ODE integration to address latent mismatch in cVAE-based SVS without redesigning the decoder.

Findings

01

Improves objective metrics in singing voice synthesis

02

Enhances perceptual quality of generated singing voices

03

Maintains synthesis efficiency with latent refinement

Abstract

Singing voice synthesis (SVS) aims to generate natural and expressive singing waveforms from symbolic musical scores. In cVAE-based SVS, however, a mismatch arises because the decoder is trained with latent representations inferred from target singing signals, while inference relies on latent representations predicted only from conditioning inputs. This discrepancy can weaken fine expressive acoustic details in the synthesized output. To mitigate this issue, we propose FM-Singer, a flow-matching-based latent refinement framework for cVAE-based singing voice synthesis. Rather than redesigning the acoustic decoder, the proposed method learns a continuous vector field that transports inference-time latent samples toward posterior-like latent representations through ODE-based integration before waveform generation. Because the refinement is performed in latent space, the method remains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Voice and Speech Disorders