Benchmarking Music Generation Models and Metrics via Human Preference Studies
Florian Gr\"otschla, Ahmet Solak, Luca A. Lanzend\"orfer, Roger Wattenhofer

TL;DR
This paper benchmarks state-of-the-art music generation models by comparing human preferences with various metrics through large-scale listening tests, providing insights into model quality and metric effectiveness.
Contribution
It introduces a comprehensive human preference dataset for music models and ranks models and metrics based on human judgments for the first time.
Findings
Human preferences correlate variably with existing metrics.
The dataset enables better evaluation of music generation quality.
Open access promotes further research in subjective metric assessment.
Abstract
Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In this work, we generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants to evaluate the correlation between human preferences and widely used metrics. To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference. To further the field of subjective metric evaluation, we provide open access to our dataset of generated music and human evaluations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
