Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck
Zhetao Hu, Yiquan Zhou, Wenyu Wang, Zhiyu Wu, Xin Gao, Jihua Zhu

TL;DR
This paper introduces a novel singing style conversion system that enhances fine-grained control, high-fidelity output, and style suppression using innovative boundary-aware techniques and high-frequency augmentation, achieving top naturalness in evaluations.
Contribution
The system employs boundary-aware bottlenecks, explicit frame-level style matrices, and high-frequency augmentation to improve style conversion quality and control with limited data.
Findings
Achieved the best naturalness in SVCC2025 evaluations.
Maintained competitive speaker similarity and style control.
Performed well with significantly less data than other systems.
Abstract
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
