STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation

Wenxiang Guo; Yu Zhang; Changhao Pan; Zhiyuan Zhu; Ruiqi Li; Zhetao Chen; Wenhao Xu; Fei Wu; Zhou Zhao

arXiv:2507.06670·cs.SD·July 10, 2025

STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation

Wenxiang Guo, Yu Zhang, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Zhetao Chen, Wenhao Xu, Fei Wu, Zhou Zhao

PDF

Open Access

TL;DR

STARS is a comprehensive framework that automates singing transcription, alignment, and style annotation, significantly improving dataset creation and controllable singing voice synthesis.

Contribution

It introduces the first unified system for multi-level singing annotation, combining transcription, alignment, and style characterization in a hierarchical, non-autoregressive architecture.

Findings

01

Outperforms existing annotation methods in accuracy and robustness.

02

Enhances singing voice synthesis with better style control and naturalness.

03

Enables scalable creation of high-quality singing datasets.

Abstract

Recent breakthroughs in singing voice synthesis (SVS) have heightened the demand for high-quality annotated datasets, yet manual annotation remains prohibitively labor-intensive and resource-intensive. Existing automatic singing annotation (ASA) methods, however, primarily tackle isolated aspects of the annotation pipeline. To address this fundamental challenge, we present STARS, which is, to our knowledge, the first unified framework that simultaneously addresses singing transcription, alignment, and refined style annotation. Our framework delivers comprehensive multi-level annotations encompassing: (1) precise phoneme-audio alignment, (2) robust note transcription and temporal localization, (3) expressive vocal technique identification, and (4) global stylistic characterization including emotion and pace. The proposed architecture employs hierarchical acoustic feature processing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders