DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching

Wei Chen; Binzhu Sha; Dan Luo; Jing Yang; Zhuo Wang; Fan Fan; Zhiyong Wu

arXiv:2508.05978·cs.SD·August 11, 2025

DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching

Wei Chen, Binzhu Sha, Dan Luo, Jing Yang, Zhuo Wang, Fan Fan, Zhiyong Wu

PDF

Open Access

TL;DR

DAFMSVC is a novel singing voice conversion method that uses dual attention and flow matching to improve timbre similarity and audio quality, effectively handling unseen speaker voices without degradation.

Contribution

The paper introduces DAFMSVC, combining a dual attention mechanism and flow matching with SSL features to address timbre leakage and quality issues in one-shot singing voice conversion.

Findings

01

Significantly improves timbre similarity and naturalness.

02

Outperforms state-of-the-art methods in evaluations.

03

Effectively prevents timbre leakage in unseen speakers.

Abstract

Singing Voice Conversion (SVC) transfers a source singer's timbre to a target while keeping melody and lyrics. The key challenge in any-to-any SVC is adapting unseen speaker timbres to source audio without quality degradation. Existing methods either face timbre leakage or fail to achieve satisfactory timbre similarity and quality in the generated audio. To address these challenges, we propose DAFMSVC, where the self-supervised learning (SSL) features from the source audio are replaced with the most similar SSL features from the target audio to prevent timbre leakage. It also incorporates a dual cross-attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content. Additionally, we introduce a flow matching module for high quality audio generation from the fused features. Experimental results show that DAFMSVC significantly enhances timbre similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing