S$^2$Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion

Ziqian Wang; Xianjun Xia; Chuanzeng Huang; Lei Xie

arXiv:2601.13629·eess.AS·January 21, 2026

S$^2$Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion

Ziqian Wang, Xianjun Xia, Chuanzeng Huang, Lei Xie

PDF

Open Access

TL;DR

S$^2$Voice is a state-of-the-art singing voice conversion system that enhances style control, timbre similarity, and robustness through innovative conditioning techniques, large-scale data curation, and a multi-stage training approach, outperforming previous methods.

Contribution

The paper introduces novel style-aware conditioning mechanisms, a large high-quality singing corpus, and a combined training strategy for improved singing style conversion.

Findings

01

Outperforms baseline in style and singer similarity

02

Achieves higher naturalness and style fidelity in subjective tests

03

Effective ablation of proposed components enhances performance

Abstract

We present S $^{2}$ Voice, the winning system of the Singing Voice Conversion Challenge (SVCC) 2025 for both the in-domain and zero-shot singing style conversion tracks. Built on the strong two-stage Vevo baseline, S $^{2}$ Voice advances style control and robustness through several contributions. First, we integrate style embeddings into the autoregressive large language model (AR LLM) via a FiLM-style layer-norm conditioning and a style-aware cross-attention for enhanced fine-grained style modeling. Second, we introduce a global speaker embedding into the flow-matching transformer to improve timbre similarity. Third, we curate a large, high-quality singing corpus via an automated pipeline for web harvesting, vocal separation, and transcript refinement. Finally, we employ a multi-stage training strategy combining supervised fine-tuning (SFT) and direct preference optimization (DPO). Subjective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies