Singing Timbre Popularity Assessment Based on Multimodal Large Foundation Model

Zihao Wang; Ruibin Yuan; Ziqi Geng; Hengjia Li; Xingwei Qu; Xinyi Li; Songye Chen; Haoying Fu; Roger B. Dannenberg; Kejun Zhang

arXiv:2512.06999·cs.SD·December 9, 2025

Singing Timbre Popularity Assessment Based on Multimodal Large Foundation Model

Zihao Wang, Ruibin Yuan, Ziqi Geng, Hengjia Li, Xingwei Qu, Xinyi Li, Songye Chen, Haoying Fu, Roger B. Dannenberg, Kejun Zhang

PDF

Open Access 1 Models

TL;DR

This paper introduces a comprehensive, reference-free singing assessment framework using a new dataset, a hybrid model architecture, and a perceptual ranking benchmark to improve evaluation accuracy and creativity in singing performance analysis.

Contribution

It presents Sing-MD dataset, VocalVerse architecture, and H-TPR benchmark, advancing automated singing assessment beyond traditional score-based methods.

Findings

01

Expert annotations show high inconsistency, questioning traditional metrics.

02

VocalVerse effectively models global performance features with limited memory.

03

H-TPR benchmark promotes perceptually valid ranking evaluation.

Abstract

Automated singing assessment is crucial for education and entertainment. However, existing systems face two fundamental limitations: reliance on reference tracks, which stifles creative expression, and the simplification of complex performances into non-diagnostic scores based solely on pitch and rhythm. We advocate for a shift from discriminative to descriptive evaluation, creating a complete ecosystem for reference-free, multi-dimensional assessment. First, we introduce Sing-MD, a large-scale dataset annotated by experts across four dimensions: breath control, timbre quality, emotional expression, and vocal technique. Our analysis reveals significant annotation inconsistencies among experts, challenging the validity of traditional accuracy-based metrics. Second, addressing the memory limitations of Multimodal Large Language Models (MLLMs) in analyzing full-length songs, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
karl-wang/QwenFeat-Vocal-Score
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Emotion and Mood Recognition · Diverse Music Education Insights