VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

Yukun Chen; Tianrui Wang; Zhaoxi Mu; Xinyu Yang; EngSiong Chng

arXiv:2605.04613·cs.SD·May 7, 2026

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

Yukun Chen, Tianrui Wang, Zhaoxi Mu, Xinyu Yang, EngSiong Chng

PDF

1 Repo 1 Models

TL;DR

VocalParse introduces a unified Large Audio Language Model for singing voice transcription, jointly modeling lyrics, melody, and note alignment to improve accuracy and generalization in singing data.

Contribution

The paper proposes a novel interleaved prompting and Chain-of-Thought strategy within a Large Audio Language Model to enhance singing voice transcription performance.

Findings

01

Achieves state-of-the-art results on multiple singing datasets.

02

Effectively models lyrics, melody, and note correspondence jointly.

03

Demonstrates improved generalization to out-of-distribution singing data.

Abstract

High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pymaster17/VocalParse
github

Models

🤗
pymaster/VocalParse
model· 52 dl
52 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.