From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano
Huan Zhang, Jinhua Liang, Simon Dixon

TL;DR
This paper benchmarks various audio encoding models on performance-level tasks in solo piano music, introducing a new dataset and demonstrating the effectiveness of domain-specific fine-tuning for nuanced musical understanding.
Contribution
It introduces the Pianism-Labelling Dataset (PLD) and evaluates pre-trained audio encoders on expertise ranking, difficulty estimation, and technique detection tasks.
Findings
Audio-MAE achieved the highest overall performance.
Best accuracy was 93.6% in expertise ranking.
Case study on Chopin data revealed challenges in top-tier performance assessment.
Abstract
Our study investigates an approach for understanding musical performances through the lens of audio encoding models, focusing on the domain of solo Western classical piano music. Compared to composition-level attribute understanding such as key or genre, we identify a knowledge gap in performance-level music understanding, and address three critical tasks: expertise ranking, difficulty estimation, and piano technique detection, introducing a comprehensive Pianism-Labelling Dataset (PLD) for this purpose. We leverage pre-trained audio encoders, specifically Jukebox, Audio-MAE, MERT, and DAC, demonstrating varied capabilities in tackling downstream tasks, to explore whether domain-specific fine-tuning enhances capability in capturing performance nuances. Our best approach achieved 93.6\% accuracy in expertise ranking, 33.7\% in difficulty estimation, and 46.7\% in technique detection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing
MethodsDense Connections · Dilated Convolution · Convolution · VQ-VAE · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Jukebox · Dynamic Algorithm Configuration
