Armor: A Benchmark for Meta-evaluation of Artificial Music
Songhe Wang, Zheng Bao, Jingtong E

TL;DR
This paper introduces Armor, a comprehensive benchmark dataset designed to evaluate the effectiveness of objective evaluation methods in artificial music, aiming to bridge the gap with subjective human judgment.
Contribution
Armor is the first rigorous, cross-domain benchmark dataset for meta-evaluating objective music evaluation methods against human judgment.
Findings
Significant gap between objective and subjective evaluations.
Armor provides a standardized framework for future research.
Objective methods still lag behind human judgment in music quality assessment.
Abstract
Objective evaluation (OE) is essential to artificial music, but it's often very hard to determine the quality of OEs. Hitherto, subjective evaluation (SE) remains reliable and prevailing but suffers inevitable disadvantages that OEs may overcome. Therefore, a meta-evaluation system is necessary for designers to test the effectiveness of OEs. In this paper, we present Armor, a complex and cross-domain benchmark dataset that serves for this purpose. Since OEs should correlate with human judgment, we provide music as test cases for OEs and human judgment scores as touchstones. We also provide two meta-evaluation scenarios and their corresponding testing methods to assess the effectiveness of OEs. To the best of our knowledge, Armor is the first comprehensive and rigorous framework that future works could follow, take example by, and improve upon for the task of evaluating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
