TL;DR
VocalParse introduces a unified Large Audio Language Model for singing voice transcription, jointly modeling lyrics, melody, and note alignment to improve accuracy and generalization in singing data.
Contribution
The paper proposes a novel interleaved prompting and Chain-of-Thought strategy within a Large Audio Language Model to enhance singing voice transcription performance.
Findings
Achieves state-of-the-art results on multiple singing datasets.
Effectively models lyrics, melody, and note correspondence jointly.
Demonstrates improved generalization to out-of-distribution singing data.
Abstract
High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
