TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
Wenxiang Guo, Yu Zhang, Changhao Pan, Rongjie Huang, Li Tang, Ruiqi, Li, Zhiqing Hong, Yongqi Wang, Zhou Zhao

TL;DR
TechSinger is a novel singing voice synthesis system that provides precise, multi-technique control across five languages using flow-matching models and natural language prompts, significantly improving expressiveness and realism.
Contribution
It introduces a flow-matching-based generative model for controllable singing synthesis with multi-language and multi-technique support, along with automatic technique annotation and natural language-based control.
Findings
Outperforms existing methods in audio quality and technique control
Supports five languages and seven vocal techniques
Enhances expressiveness and realism of synthetic singing voices
Abstract
Singing voice synthesis has made remarkable progress in generating natural and high-quality voices. However, existing methods rarely provide precise control over vocal techniques such as intensity, mixed voice, falsetto, bubble, and breathy tones, thus limiting the expressive potential of synthetic voices. We introduce TechSinger, an advanced system for controllable singing voice synthesis that supports five languages and seven vocal techniques. TechSinger leverages a flow-matching-based generative model to produce singing voices with enhanced expressive control over various techniques. To enhance the diversity of training data, we develop a technique detection model that automatically annotates datasets with phoneme-level technique labels. Additionally, our prompt-based technique prediction model enables users to specify desired vocal attributes through natural language, offering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
