S$^2$Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion
Ziqian Wang, Xianjun Xia, Chuanzeng Huang, Lei Xie

TL;DR
S$^2$Voice is a state-of-the-art singing voice conversion system that enhances style control, timbre similarity, and robustness through innovative conditioning techniques, large-scale data curation, and a multi-stage training approach, outperforming previous methods.
Contribution
The paper introduces novel style-aware conditioning mechanisms, a large high-quality singing corpus, and a combined training strategy for improved singing style conversion.
Findings
Outperforms baseline in style and singer similarity
Achieves higher naturalness and style fidelity in subjective tests
Effective ablation of proposed components enhances performance
Abstract
We present SVoice, the winning system of the Singing Voice Conversion Challenge (SVCC) 2025 for both the in-domain and zero-shot singing style conversion tracks. Built on the strong two-stage Vevo baseline, SVoice advances style control and robustness through several contributions. First, we integrate style embeddings into the autoregressive large language model (AR LLM) via a FiLM-style layer-norm conditioning and a style-aware cross-attention for enhanced fine-grained style modeling. Second, we introduce a global speaker embedding into the flow-matching transformer to improve timbre similarity. Third, we curate a large, high-quality singing corpus via an automated pipeline for web harvesting, vocal separation, and transcript refinement. Finally, we employ a multi-stage training strategy combining supervised fine-tuning (SFT) and direct preference optimization (DPO). Subjective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies
