SongEval: A Benchmark Dataset for Song Aesthetics Evaluation
Jixun Yao, Guobin Ma, Huixin Xue, Huakang Chen, Chunbo Hao, Yuepeng Jiang, Haohe Liu, Ruibin Yuan, Jin Xu, Wei Xue, Hao Liu, Lei Xie

TL;DR
SongEval introduces a comprehensive, annotated dataset of 2,399 full-length songs with human aesthetic ratings across multiple musical dimensions, enabling improved evaluation of song quality beyond existing metrics.
Contribution
This paper presents SongEval, the first large-scale benchmark dataset for song aesthetics with multi-dimensional human annotations, facilitating better evaluation of generated music.
Findings
SongEval outperforms existing metrics in predicting human aesthetic ratings.
The dataset covers diverse genres and languages, enhancing its applicability.
Experimental results show improved correlation with human judgments.
Abstract
Aesthetics serve as an implicit and important criterion in song generation tasks that reflect human perception beyond objective metrics. However, evaluating the aesthetics of generated songs remains a fundamental challenge, as the appreciation of music is highly subjective. Existing evaluation metrics, such as embedding-based distances, are limited in reflecting the subjective and perceptual aspects that define musical appeal. To address this issue, we introduce SongEval, the first open-source, large-scale benchmark dataset for evaluating the aesthetics of full-length songs. SongEval includes over 2,399 songs in full length, summing up to more than 140 hours, with aesthetic ratings from 16 professional annotators with musical backgrounds. Each song is evaluated across five key dimensions: overall coherence, memorability, naturalness of vocal breathing and phrasing, clarity of song…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Comprehensive Dataset: SongEval provides a large-scale benchmark dataset with over 2,399 full-length songs, allowing for extensive analysis and evaluation of song aesthetics across various genres and languages. Multi-Dimensional Evaluation: The dataset includes aesthetic ratings across five key dimensions (overall coherence, memorability, naturalness, clarity, and musicality), offering a nuanced approach to assessing musical appeal beyond simple metrics. Improved Predictive Performance: Experi
Subjectivity of Aesthetic Evaluation: The appreciation of music is inherently subjective, and even with professional annotators, individual biases may influence the aesthetic ratings, potentially affecting the dataset's reliability. Limited Scope of Genres: While SongEval covers nine mainstream genres, it may not encompass all musical styles, which could limit its applicability for evaluating songs from less represented genres or niche markets. Dependence on Annotator Expertise: The quality of
- It is a relevant problem to study and assess song aesthetics, and the newly introduce benchmark dataset may be useful to the community. - In experiments, it was shown that the trained models to predict aesthetic scores achieves better performance than existing objective evaluation metrics in predicting human-perceived musical quality. - Writing is easy to follow.
- The authors use song generation models to generate songs rather than actual songs. So the qualities of them largely depend on the song generation model and it may be desirable to use actual songs for building the benchmark and annotations. - My main concern is the experiment designs to show the effectiveness of the new dataset. This paper first proposes to compare the predictions from the trained evaluation models on the dataset, with the human annotations. Even though the reported results lo
With 2,399 full-length songs (140+ hours), SongEval is significantly larger than existing alternatives. It covers two languages (English, Chinese) and nine genres, and includes both vocals and accompaniment—addressing the limitation of single-component focus in prior datasets. The authors test four models with distinct architectures (convolutional, self-supervised, ensemble-based) and evaluate performance at both utterance and system levels. Direct comparisons to established objective metrics (e
The dataset is dominated by generated songs from five models, with only a small subset of non-copyrighted real songs. This narrow focus on generated content limits its ability to evaluate models on real-world, human-composed music—a critical use case for aesthetic evaluation.
-The paper introduces SongEval, a large-scale, open-source benchmark dataset that addresses an important and challenging problem: evaluating the aesthetic quality of full-length AI-generated songs. -The dataset is both substantial and diverse, comprising 2,399 songs (over 140 hours of audio) in English and Chinese across nine musical genres. -The annotations are provided by 16 professional musicians, offering high-quality expert evaluations that strengthen the reliability and credibility of the
-The methodological basis for the chosen subjective evaluation dimensions is not well justified. The paper does not demonstrate evidence of a systematic literature review or structured expert interviews to support the selection of the five aesthetic criteria. -The overall musicality dimension is broad and conceptually ambiguous. Even with the provided definition, its interpretive scope risks introducing significant variance among annotators, especially since it aggregates multiple perceptual qua
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Artificial Intelligence in Games
