A Survey on Evaluation Metrics for Music Generation
Faria Binte Kader, Santu Karmaker

TL;DR
This survey reviews current evaluation metrics for music generation, highlighting their limitations and proposing future research directions to develop a comprehensive framework that better aligns with human perception.
Contribution
It provides a detailed taxonomy of evaluation metrics for audio and symbolic music, critically analyzes their limitations, and suggests future research directions.
Findings
Current metrics often poorly correlate with human perception
Major limitations include cultural bias and lack of standardization
Proposes future directions for comprehensive evaluation frameworks
Abstract
Despite significant advancements in music generation systems, the methodologies for evaluating generated music have not progressed as expected due to the complex nature of music, with aspects such as structure, coherence, creativity, and emotional expressiveness. In this paper, we shed light on this research gap, introducing a detailed taxonomy for evaluation metrics for both audio and symbolic music representations. We include a critical review identifying major limitations in current evaluation methodologies which includes poor correlation between objective metrics and human perception, cross-cultural bias, and lack of standardization that hinders cross-model comparisons. Addressing these gaps, we further propose future research directions towards building a comprehensive evaluation framework for music generation evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Neuroscience and Music Perception
A Survey on Evaluation Metrics for Music Generation
Faria Binte Kader
University of Central Florida
&Santu Karmaker
University of Central Florida
Faria Binte Kader
University of Central Florida
[email protected] &Santu Karmaker
University of Central Florida
Abstract
Despite significant advancements in music generation systems, the methodologies for evaluating generated music have not progressed as expected due to the complex nature of music, with aspects such as structure, coherence, creativity, and emotional expressiveness. In this paper, we shed light on this research gap, introducing a detailed taxonomy for evaluation metrics for both audio and symbolic music representations. We include a critical review identifying major limitations in current evaluation methodologies which includes poor correlation between objective metrics and human perception, cross-cultural bias, and lack of standardization that hinders cross-model comparisons. Addressing these gaps, we further propose future research directions towards building a comprehensive evaluation framework for music generation evaluation.
1 Introduction
Recent advancements in computational music research have significantly improved the ability of machines to understand and generate music (Yuan et al., 2024; Copet et al., 2024; Schneider et al., 2024). Large Language models Chang et al. (2024) and Diffusion-based models Yang et al. (2023) have now the ability to compose and edit melodies, even generate complex musical pieces that mimic human creativity (Yu et al., 2023; Zhang et al., 2023b). One such example is Suno.ai111https://suno.com/, a web-based service that, given a simple prompt with lyrics, can generate a full song, adding a singing voice within seconds. While generative models continue to improve, music generation evaluation at a large scale still lacks standardized assessment criteria due to the inherently subjective and multidimensional nature of musical quality.
A wide range of evaluation metrics has been proposed to assess the quality of both generated audio and symbolic music scores, from statistical comparisons Chen et al. (2024) to machine learning-based similarity measures Suzuki et al. (2023) between generated and reference music. Some metrics also focus on specific musical features, such as melody Yu et al. (2022), rhythm Sheng et al. (2021), harmony Harte et al. (2006), and emotional expression Imasato et al. (2023). In addition, human evaluation is used to rate subjective qualities like overall quality and prompt alignment, which remain essential for judging expressiveness and creativity.
Unfortunately, as we discuss in detail later in the paper, these metrics rarely capture the complexities of human musical perception. The challenge lies in balancing quantitative measures with subjective listening studies Yang and Lerch (2020), as musical quality is often tied to aesthetic preference, cultural background, and contextual interpretation Huron (2001). While benchmarks such as MARBLE Yuan et al. (2023) and MusicTheoryBench Yuan et al. (2024) offer standard evaluation methods for music understanding and retrieval tasks, no comprehensive framework exists for evaluating generated music scores. To highlight the gravity of this significant gap in the current literature, we provide, in this paper, a comprehensive overview of the evaluation metrics currently used in music generation tasks. We examine computational evaluation techniques, highlighting current limitations, and propose a direction for future improvements. By analyzing existing evaluation strategies, this work aims to shed light on ongoing efforts to develop more robust, interpretable, and standardized music evaluation frameworks.
2 Background on Computational Music
2.1 Music Representation
Existing music representation techniques deal with two types of music data- audio and symbolic scores to make them computer-interpretable.
**Audio Representations **like Log Mel Spectrograms Logan et al. (2000), MFCCs Davis and Mermelstein (1980), and Chroma Features Takuya (1999) transform raw audio waveforms into machine usable formats for generation and analysis tasks. Pre-trained text-audio encoders like CLAP Elizalde et al. (2023) and MuLan Huang et al. (2022) jointly represent audio and text in the same embedding space.
Symbolic Representations like MIDI Rothstein (1995), MusicXML Good (2021), ABC Notation Walshaw (2021), LilyPond Nienhuys and Nieuwenhuizen (2003) etc. represent pitch, rhythm, and dynamics in text or event-based form and are widely used in generating and editing text-based musical scores. To make symbolic music more suitable for machine learning, various tokenization methods such as REMI Huang and Yang (2020), SMT-ABC Qu et al. (2024), Octuple Zeng et al. (2021) etc. encode attributes like pitch, duration and timing data into sequences of tokens.
2.2 Music Generation Models
A big part of music computational research is Music Generation. Based on the representations, recent advancements in music generation models can be categorized into two variations-
**Audio Music Generation Models **made sequential advancements from transformer-based models like MusicLM Agostinelli et al. (2023) and MusicGen Copet et al. (2024) to diffusion-based models like Noise2Music Huang et al. (2023a), Mosai Schneider et al. (2024), AudioLDM2 Liu et al. (2024a) and ERNIE-Music Zhu et al. (2023). These models can generate good quality music from textual descriptions. Recent advancements in music generation include commercial websites like Suno222https://suno.com/ along with open-source models- Yue Yuan et al. (2025), SongGen Liu et al. (2025b), Ace-Step Gong et al. (2025) and DiffRhythm Ning et al. (2025) that can generate full length songs with proper voice coordinated lyrics.
Symbolic Music Generation Models focus on producing musical scores in formats like MIDI or ABC notation and are capable of generating multi-instrument compositions. Unfortunately due to the textual nature of the representations, these models can not produce realistic vocals. Symbolic Music Generation is particularly useful for music composing, understanding and editing. The symbolic generation models underwent significant improvements as well from utilizing GANs (MuseGAN Dong et al. (2018)) and transformers (Museformer Yu et al. (2022)) to diffusion-based models like SD-Muse Zhang et al. (2023a).
2.3 Datasets and Benchmarks
Music datasets can be binned into two variations-
Symbolic Music Datasets contain musical scores in formats like MIDI, MusicXML or ABC notation and can sometimes be paired with their corresponding audio. With MIDI datasets being the most popular for example- Lakh MIDI Dataset Raffel (2016), Popular examples include Lakh MIDI Dataset Raffel (2016), MAESTRO Hawthorne et al. (2018), POP909 Wang et al. (2020) and Million-MIDI Dataset (MMD) Zeng et al. (2021). ABC notation datasets such as Notthingham dataset333https://ifdo.ca/~seymour/nottingham/nottingham.html and Textune Wu and Sun (2022) have become popular as well for better readability and editing.
Audio Music Datasets consist of raw audio recordings with additional metadata and are commonly used for tasks such as music generation, classification, and transcription. Notable datasets include MusicCaps Agostinelli et al. (2023), MusicBench Melechovsky et al. (2023) and MuLaMCap Huang et al. (2023a), which provide music clips with descriptive captions and are usually used in tasks like music generation, music captioning and retrieval. GTZAN dataset Sturm (2013) is usually helpful for genre classification, and FMA Defferrard et al. (2016) for music tagging task.
2.4 Popular Musical Understanding Tasks
Figure 1 illustrates music-related tasks with their corresponding evaluation metrics. Besides generation tasks, computational music research revolves around Music Understanding-related tasks, which include a variety of downstream tasks that are briefly discussed below-
Music Information Retrieval (MIR) covers tasks such as key and tempo estimation, genre and style classification, beat detection, chord estimation, instrument identification Raffel et al. (2014). MARBLE Benchmark Yuan et al. (2023) provides a standardized evaluation for 18 such MIR tasks.
Music Question Answering involves answering music-related questions based on symbolic or audio input (Deng et al., 2023; Liu et al., 2024b). These models are assessed using metrics frequently used such as BLEU, METEOR, and ROUGE-L, and sometimes human evaluation Melechovsky et al. (2023) or LLM-based scoring Gardner et al. (2023).
Music Captioning deals with generating lyrics given audio (Gardner et al., 2023; Deng et al., 2023) using joint audio-text representations. Evaluation metrics are fairly similar to Music Question Answering due to the same nature of the output.
Music Retrieval and Recommendation works with joint audio-text representations as well to retrieve relevant audio or symbolic music from textual prompts (Wu et al., 2023b; Manco et al., 2022) and usually utilizes ranking metrics such as Recall@K, HR@K, and MAP.
Music Agents such as MusicAgent Yu et al. (2023), ComposerX Deng et al. (2024), Loop CopilotZhang et al. (2023b) are autonomous systems that integrate multiple AI models to perform diverse music-related tasks.
3 Music Generation Evaluation
Generated Music evaluation can be broadly divided into two categories: (1) Subjective evaluation via human judgment and listening tests, and (2) Automatic objective evaluation using computational metrics. This section first reviews objective evaluation methods and their shortcomings, followed by human evaluation methods and lastly discusses ongoing efforts in benchmark development for evaluation.
3.1 Automatic Objective Evaluation
Automatic objective evaluation encompasses computational methods for assessing generated music. As shown in Figure 2, it includes both reference-based evaluation, which compares generated outputs to ground-truth references across audio and symbolic modalities, and reference-free evaluation, which assesses the generation’s quality and structure on its own.
3.1.1 Reference Based Evaluation
Reference-based metrics help assess the extent to which the generation is similar to the target reference. As audio music and symbolic scores are of different modalities (signal and text), their evaluation metrics vary as well and are discussed separately for better clarity.
Audio Similarity Evaluation: Most commonly used audio similarity metrics like KLD (Kullback-Leibler divergence) and FAD (Fréchet Audio Distance) assess how well generated audio matches a target distribution. KLD measures the difference between two probability distributions. In music evaluation, these distributions are often derived from the outputs of pretrained audio classifiers (like PANNs Kong et al. (2020) and PaSST Koutini et al. (2021)) or features Chen et al. (2024), allowing KLD to capture a high-level semantic similarity between generated and reference audio sets.
On the other hand, FAD Kilgour et al. (2018) evaluates whether generated audio is plausible and clean by comparing its distribution to a background dataset using embeddings from pretrained audio classifiers and measuring their Fréchet distance. Even though FAD is widely used, its effectiveness depends on the choice of audio classifier (Huang et al., 2023a; Tailleur et al., 2024), reference set quality (Gui et al., 2024). It assumes that the audio feature embeddings follow a Gaussian distribution which is often false for real-world audio, whose feature distributions can be complex and non-Gaussian Chung et al. (2025). Which should be used as the appropriate audio classifier and reference set for FAD is still debatable Gui et al. (2024); Lee et al. (2024); Evans et al. (2025).
Larger reference sets yield more stable and accurate FAD scores, while small ones cause biased estimates due to poor statistical representation. To correct this, FAD Gui et al. (2024) was proposed which approximates FAD as if computed with an infinite-sized reference set.
To tackle the limitations of FAD, recently newer metrics like KAD (Kernel Audio Distance) Chung et al. (2025) and MAD (MAUVE Audio Divergence) Huang et al. (2025) metrics were proposed. KAD uses Maximum Mean Discrepancy (MMD) to compare distributions without assuming a Gaussian distribution, making it more reliable with small sample sizes. MAD also avoids the Gaussian assumption which uses self-supervised MERT embeddings and k-means clustering to better capture complex distributions. Both KAD and MAD metrics have shown better correlation with human preferences than FAD. This shows that research efforts are being made to create more perceptually relevant objective evaluation metrics.
Symbolic Score Similarity Evaluation: Evaluating symbolic music is less standardized than audio evaluation, with many works defining their own metrics and using various symbolic representations. The most common framework, proposed by Yang and Lerch (2020) which used Overlapped Area (OA) and Kullback-Leibler Divergence (KLD) to compare pitch and rhythm feature distributions between generated and reference sets. OA and KLD can give us an idea of whether the features from generation are similar to the reference set, or to what extent. While useful, OA computes feature histograms over the entire sequence, failing to account for temporal order. To address this, Macro Overlapped Area (MOA) von Rütte et al. (2022) was introduced to incorporate temporal order as well. Additionally, less common similarity-based metrics are listed in Figure 2.
Originality Evaluation: A critical task is ensuring that generative models produce novel content rather than simply copying their training data. Earlier methods used pattern matching like n-grams Hakimi et al. (2020), longest common subsequence Chu et al. (2016) and cardinality-based similarity scores Yin et al. (2021) to detect overfitting. Recent approaches used exact and approximate semantic token matches Agostinelli et al. (2023) and embedding-based methods, such as LAION-CLAP Wu et al. (2023c) embeddings, to identify repeated audio segments, which are then verified through manual listening Evans et al. (2024, 2025).
3.1.2 Reference Free Evaluation
Reference-free metrics address this by assessing two key dimensions: 1) quality of the generation on its own and 2) its adherence to user instructions.
Music Quality Evaluation: Music quality evaluation includes both theoretical and perceptual quality evaluation. It involves assessing the structural integrity of the composition based on music theory as well as evaluating whether the music is aesthetically pleasing and emotionally impactful to listeners.
**Automatic Audio Quality Evaluation: **Even though there is no defined way to quantify audio quality, standalone metrics for perceived audio quality are constantly being developed. Inception Score Salimans et al. (2016) is used to assess quality and diversity but can be misleading if a model overfits on its training data Donahue et al. (2018b).
PAM Deshmukh et al. (2024) assesses overall audio quality without a reference by using an audio-language model to detect distortions and artifacts by comparing an audio sample against contrasting text prompts ("clear sound" vs. "noisy sound"). Audiobox Aesthetics Tjandra et al. (2025) is a domain-agnostic model trained on 97,000 annotated clips to predict four distinct and interpretable aesthetic dimensions- Production Quality, Production Complexity, Content Enjoyment and Content Usefulness. The latest trend involves training aesthetic predictors (Yao et al., 2025) directly on large-scale human preference datasets (Huang et al., 2025; Liu et al., 2025a; Yao et al., 2025). Human preference datasets mainly contain generative songs that are annotated with human preference ratings (details of the datasets are discussed in 3.3). Even though newer works (Yuan et al., 2025; Zhang et al., 2025; Gong et al., 2025) have quickly started to adapt Audiobox Aesthetics in their evaluation, Yao et al. (2025) showed that models trained on their human preference dataset, SongEval outperform Audiobox Aesthetics in predicting human-perceived musical quality.
Symbolic Score Quality Evaluation: Symbolic Score Quality Evaluation remains less advanced compared to audio quality evaluation as well. It typically involves manual or rule-based analysis to assess the structural correctness of the score and the quality of the features.
A) Symbolic Score Structure Evaluation: These metrics can be utilized to check if the generation is maintaining a proper structure and adhering to the music theory or not. Checking for irregularity in chords Wu et al. (2023a), rhythmic consistency Lu et al. (2023), and scale consistency Mogren (2016) are some ways to check for feature-wise structures in generations, but the use of these metrics is not standardized. Empty Bars (EB) (ratio of empty bars) and Format Correctness Evaluation Yuan et al. (2024) are used for calculating syntactical accuracy. Some works (Yuan et al., 2024; Wu and Yang, 2020; Chuan and Herremans, 2018; Lattner et al., 2018; Chen et al., 2019) checked for repeating patterns in the generated score, as it can indicate music-like structure.
B) Feature Quality Evaluation: These metrics are feature heavy and may provide some insight into the quality of specific musical features used, however, are no way sufficient to quantify the overall music quality. There is Figure 2 lists the metrics used for checking the quality of Chords, Pitch and Note and Rhythm respectively. Visualizing tools such as- Spectrogram of generated waveforms Zhu et al. (2023), Constant-Q Transform spectrograms Engel et al. (2017), Pianorolls Dong et al. (2018), Keyscapes Lattner et al. (2018), Fitness scape plotsMüller and Jiang (2012) can be utilized to assess feature quality visually.
**Adherence to Instruction Evaluation: **Adherence to Instruction Evaluation measures how well generated music aligns with input directives, which can be textual prompts or structured controls like lyrics, chords or style, ensuring the output faithfully reflects the intended guidance.
**Adherence to Textual Prompts Evaluation: ** For text-to-music models, adherence to textual prompts is typically measured by computing the cosine similarity between the text embedding of the prompt and the audio embedding of the generation. While CLAP Score Huang et al. (2023b); Evans et al. (2024) is common where embeddings are derived from CLAP (Elizalde et al., 2023; Wu et al., 2023c) models, it is a non-music specific model. Other alternatives like MuLan embeddings, MuQ-MuLan Zhu et al. (2025) and CLAMP 3 model Wu et al. (2025) showed better performance due to being trained on more music-aware tasks and larger datasets (Agostinelli et al., 2023; Gong et al., 2025; Yuan et al., 2025).
Adherence to Lyrics: Phoneme Error Rate(PER) is used to check how well the given lyric aligns in the generated audio. PER is calculated by extracting the vocal track and passing that to a lyrics recognition model Lei et al. (2025). Sheng et al. (2021) evaluated alignment accuracy of the melody and lyrics to ensure structural consistency.
Adherence to Other Control Inputs: Control inputs for symbolic music generation other than textual descriptions can affect the selection of evaluation metrics. Some works (Wu et al., 2024; Melechovsky et al., 2023; Yeh et al., 2021; Ren et al., 2020) evaluated fine-grained feature control ability of their models by using few feature specific metrics listed in figure 2, but use of these metrics are less common in literature. style or genre adherence is often evaluated using a dedicated classifier (Brunner et al., 2018b; Jin et al., 2020). Since classifier scores only indicate the presence of some distinguishing features rather than true stylistic conformity, C´ıfka et al. (2019) proposed a more interpretable style fit metric to evaluate stylistic alignment. In emotion-controlled generation, discriminator models have been used to classify whether a generated piece belongs to the intended emotional category Imasato et al. (2023).
Appendix B lists some of metric definitions and appendix C mentions currently available toolkits used for evaluation, which were skipped over due to space shortage.
3.2 Human Evaluation
Since there is still no clear method to assess creativity and musical quality, most music generation evaluations rely on human judgment for validation. Human evaluation involves designing appropriate listening experiments with logically useful assessment criteria involving appropriate candidates and environment to qualitatively evaluate generated music. In Comparison based listening tests, listeners are often asked to compare two or more samples. This can be called a Turing Test, where the goal is to distinguish between human-composed and AI-generated music (Lee et al., 2022; Donahue et al., 2018a, 2019), or a preference test asking which sample is of higher quality (Deng et al., 2024; Hawthorne et al., 2018).
Other than comparison, participants rate generated music on one or more criteria, typically using a Likert scale Huang et al. (2023b) or by providing a Mean Opinion Score (MOS) Liu et al. (2025a). Assessment criteria are much less standardized as works (Melechovsky et al., 2023; Jin et al., 2020) usually define their own assessment criteria and can be broadly categorized into these evaluation aspects-
- •
Musical structure according to music theory assesses how well the audio follows logical and theory-aligned musical organization.
- •
Music quality captures aspects like creativity, harmonic richness, and emotional impact.
- •
Adherence to instruction measures how accurately the output reflects the given prompt.
- •
Quality of vocals evaluates the attractiveness and harmonic integration of vocals in the audio.
Figure 2 has the assessments listed typically used in human evaluation and Appendix A discusses their definition. A listening test design can be task-specific as well, for example- Jin et al. (2020) conducted a listening test to evaluate classical music generation and defined own assessments criteria with respect to the characteristics of only classical songs. Suzuki et al. (2023) used OpenAI’s ChatGPT and Google’s Bard to assess the generated music’s atmosphere and genre as well as their human evaluation counterpart on these exact metrics. Hypothesis tests such as Kruskal-Wallis H test, Wilcoxon signed-rank test, t-tests are done to validate the statistical significance of the human ratings (Donahue et al., 2019; Hawthorne et al., 2018).
3.3 Benchmarks
MusicCaps Agostinelli et al. (2023), MusicBench Melechovsky et al. (2023) and Song Describer Dataset Manco et al. (2023) are often used to evaluation text-to-audio music (TTM) generation models444https://paperswithcode.com/task/text-to-music-generation#datasets (Evans et al., 2024, 2025). Ziqi-Eval’s music generation question set Li et al. (2024) offers 184 multiple-choice and 200 five-shot questions to test LLMs on melody continuation, technically assessing music understanding rather than generation capabilities. Several human preference datasets have been proposed- MusicPrefs Huang et al. (2025) with 183,000 clips and crowdsourced pairwise ratings for fidelity and musicality. Dynamo Music Aesthetics (DMA) Bai et al. (2025) includes 800 prompts, 1,676 pieces (15.97 hours) and 2,301 detailed 1–5 ratings from 63 raters. MusicEval Liu et al. (2025a) contains 2,748 clips from 31 TTM models with over 13,000 expert ratings for musical impression and text alignment. SongEval Yao et al. (2025) is a large-scale benchmark of 2,399 songs (140+ hours), rated by 16 professionals across five dimensions: coherence, memorability, naturalness, clarity and musicality.
4 A Critical Review
In this section, we present some critical analyses of the current music generation evaluation metrics, followed by identifying research gaps and pathways for future research to overcome them.
4.1 A Critical Analysis of Objective Metrics
Limitations of Similarity-Based Metrics : High scores on similarity-based metrics do not guarantee high-quality or musically meaningful compositions. Similarity with target distribution simply means generated scores show similar characteristics as the reference set, but no way quantifies if the piece itself is a good sounding piece or a distuned boring sounding piece. Unless it is a controlled generation, syntactical similarity metrics like BLEU, Average Sample-wise Accuracy and Chord Matchness can easily seem useless for the same reason. Only assessing the similarity with the reference leads to an incomplete evaluation and should be accompanied with reference free music quality evaluation.
Lack of Interpretation : Yuan et al. (2025) showed that many widely used objective metrics, such as CLAP-score, FAD, and KLD, often align poorly with human preferences, which makes the conclusions of prior studies that rely on these measures questionable. A core issue is that these metrics lack clear interpretability. For example, metrics like OA and KLD score are considered as the higher the better, but have no meaningful threshold or guidance for balancing similarity and originality. Similarly, Chord Progression Irregularity Wu et al. (2023a) measures the percentage of unique chord trigrams, where lower values suggest greater stability yet extremely low values can be interpreted as a boring sequence as well. While these scores can rank models and indicate feature quality, better scoring outputs may not correspond to better sounding to listeners. Overall, objective metrics alone can’t reliably evaluate musical quality and risk misrepresenting what truly sounds good without human evaluation.
Lack of Cross-cultural Consideration : A significant limitation in music generation evaluation arises from cross-cultural biases in datasets, benchmarks, and evaluation methods. Mehta et al. (2025) quantified the severe Western-centric bias in 152 musical dataset proposing papers, finding only 5.7% of music comes from non-Western genres, including South Asian, Middle Eastern, Oceanian, Central Asian, Latin American and African music combined. Models trained on these datasets struggle to generate low-resource genres, while evaluation metrics tailored to Western styles may fail to assess diverse musical characteristics and lack suitable training data. For example, FAD’s reliance on the choice of reference set and audio embeddings raises concerns that it may favor only well-resourced genres. Another example can be, in Pitch Histogram Entropy, a high entropy suggests unstable tonality and pitch classes are more scattered which may favor straightforward genres like pop but is ill-suited for evaluating microtonal, polyrhythmic or improvisational music from low resource traditions. Similarly, corpus-based evaluations favor well-documented styles while overlooking culturally unique ones. Wilson et al. (2025) further highlighted limited transparency, with few models disclosing training data or generation methods, hindering efforts to address these biases.
Lack of Standardization : While feature-specific metrics can be useful for analyzing individual systems, they often fail to generalize, with many researchers adapting their own evaluation criteria. The resulting overwhelming number of specialized metrics make it difficult to determine which are truly effective, hindering clear assessment and comparison of different generative models’ strengths and weaknesses.
Limitations of Music Quality/Aesthetic Predictors : For efficiency and growing need of large-scale evaluation, recent works are shifting towards automatic music quality evaluation using aesthetic predictors like Audiobox Aesthetics Tjandra et al. (2025). Unfortunately, Zhang et al. (2025) highlighted that human preference datasets often misalign with these independently trained aesthetic predictors. This indicates that human preference is not a single, consistent concept as human perception of creativity is subjective and shaped by geography, history, and culture Lubart (1999). Different evaluation methods, even if both are based on human feedback, can lead to contradictory conclusions about music quality, raising concerns about their reliability and generalizability. Zhang et al. (2025) further showed aesthetic predictors favor certain content, with tracks featuring "punchy kick" or "synth".
Limitations of Human Preference Datasets: Human preference datasets can introduce bias as they are constructed with generative audios from current TTM models which often fail with low-resource genres as well. Furthermore most of the human preference datasets rely solely on overall impression Huang et al. (2025); Liu et al. (2025a) or preference Bai et al. (2025) which is insufficient for modeling human perception of musical creativity. We have to further break down the judgment of creativity into several equal dimensions and employ experts to rate audios across these dimensions. A simple example of why this works is, despite individual taste differences, expert food critics evaluate dishes across equally important dimensions like flavor, texture, presentation, originality, execution, and overall impression. Among the human preference datasets, SongEval Yao et al. (2025) broke music quality evaluation into multiple dimensions, but further analysis with music experts is needed to ensure the dimensions are enough to cover all the aspects of quality evaluation.
**Limitations of Symbolic Music Evaluation : **Symbolic music generation evaluation improvement is lagging behind audio music evaluation in both standardization and depth of analysis, largely because symbolic representations lack direct perceptual grounding. While audio evaluation heading towards using perceptual and embedding-based metrics that can align well with human perception, symbolic evaluation often relies on simplistic feature based measures that might miss important aspects of music quality, creativity, and correlation with human perception. Furthermore, symbolic evaluation lacks standard benchmarks, representations and validated features, making it hard to compare models or ensure metrics generalize across styles.
4.2 A Critical Analysis of Human Evaluation
Sensitivity to Participant Background : Designing a listening test can be challenging as they are highly sensitive to factors like- variation in participant expertise and uneven participant group sizes, followed by biases due to age, education, cultural exposure, cognitive traits (Ferreira et al., 2023; Yang and Lerch, 2020). The chances of these biases increase when participants come from a single background, limiting generalizability. Ferreira et al. (2023) conducted a blind listening test with 117 participants from diverse backgrounds to evaluate their ability to distinguish between AI-generated and human-composed music. Results showed that frequent classical music listeners, musicians and individuals with high self-assessed musical sensitivity were significantly more accurate in identifying the source, highlighting the need to appoint raters with appropriate musical background and perceptual skill.
Experimental Design Challenges : Aside from participant expertise, design of a listening test is not standardized as well with factors to consider like- sample selection, environment setting of the listening test and phrasing of the surveys. Environment variations, confusing phrasing of the surveys and small sample sizes reduce statistical reliability. For example, in their listening test, Schneider et al. (2024) defined musicality as how much the given sound is melodiousness and harmoniousness, whereas Yuan et al. (2024) defined musicality based on two aspects- the overall consistency of the music in terms of melodic patterns and chord progressions etc. and the presence of a clear structural development with respect to features. With works designing their own listening test criteria and the high cost of large-scale studies is a big setback for standardized, cross-model comparisons Yang and Lerch (2020).
4.3 Summary of Major Limitations
Among the challenges in music generation evaluation discussed in previous section, several stand out as particularly critical. The lack of interpretability and reliability of objective metrics undermines the evaluation’s ability to draw meaningful conclusions, as widely used measures often misalign with human perception and lack clear thresholds for quality. The lack of cross-cultural consideration introduces severe biases by favoring Western music traditions in datasets and evaluation methods. The lack of standardization in evaluation methodologies make cross-model comparisons difficult as well. Finally, limitations in designing a listening test for human evaluation weaken the validity of listener studies intended to capture subjective musical quality. Unfortunately, these limitations question the credibility and inclusiveness of music generation evaluation methods which calls for the urgent need for more interpretable, culturally aware standardized evaluation frameworks.
4.4 Research Gaps
This section identifies three open research questions in music generation evaluation paradigm, each illustrating a distinct category of research gap. First, despite extensive study, the question “How to model and evaluate creativity in music? Does modeling human perception automatically model creativity?” remains unresolved, as existing methods struggle to deliver robust or generalizable solutions for capturing the subjective nature of creativity. Second, the question “Can the existing evaluation methods cater to underrepresented genres?” is currently understudied, requiring better evaluation methods for underrepresented genres. Third, “How can future efforts in music evaluation develop robust methodologies that effectively integrate computational analysis with listener perception studies and task-specific benchmarks?” represents an area yet to be systematically explored with the joint efforts of music experts and cognitive scientists to design a comprehensive evaluation frameworks.
4.5 Opportunities and Future Directions
We think there should be 3 components for a comprehensive music generation evaluation framework: 1) evaluating music quality and structure, 2) adherence to instruction, and 3) evaluating similarity with reference, respectively (figure 3). Future efforts in music evaluation should focus on developing more robust and generalized evaluation methodologies that integrate computational analysis with listener perception studies, cross-cultural considerations and task-specific benchmarks for these 3 components. We welcome the ongoing efforts to emulate human perception of music through automatic aesthetic predictors and human preference datasets for large scale evaluation, but significant research effort is needed to break down the concept of music quality and structure into smaller, definable dimensions whose scores can jointly give us an interpretable way to quantify music quality, rather than only depending on confusing terms such as “overall quality”. We further propose a possible automatic music quality and structure evaluation framework that incorporates the idea of human-in-the-loop training and Reinforcement Learning Kaelbling et al. (1996) to rank the subjective quality of a generated music according to human perception across scientifically defined dimensions. Starting with a pretrained model on such a human preference dataset, the model will receive original songs as well as generated songs as inputs and predict scores across pre-defined dimensions of music quality. These predictions can be compared with expert ratings to compute a reward to fine-tune the model. Through this feedback loop, the model can learn to align its predictions more closely with human perception. The catch is to have experts from various cultural backgrounds and use original songs specially for low-resource genres to make the aesthetic predictor model less biased and more generalizable.
5 Conclusion
Evaluation for music generation is still a complex challenge due to the inherent subjectivity of music as we are yet to discover how to quantify human perception of creativity. With the recent efforts to model human perception with automatic aesthetic predictors, it is at a very early stage where further research with cognitive scientists and music experts is absolutely necessary to determine modular interpretable evaluation dimensions that would quantify overall quality of a music piece. Furthermore, it is equally necessary to acknowledge and address the biases and lack of interpretation present in current music generation models and evaluation methodologies to make music generation more generalizable to the global music community.
Appendix A Human Evaluation
A.1 Musical Structure according to Music Theory
Structureness: If the music is structured nicely or not Liu et al. (2022). More fine-grained structural aspects were used by Yu et al. (2022). Short-term structure: Whether the generated score is showing good structures, good repetitions and reasonable development within a close range. Long-term structure: Whether the generated score is showing good structures, song-level repetitions and long distance connections within a broader range.
**Correctness: ** Does the listener perceive any absence of composing or playing mistakes Hsiao et al. (2021).
**Fluency: ** If the generated music sounds fluent or not Zhang (2020).
**Arrangement: **Are the instruments used reasonably and arranged properly? Dong et al. (2023).
Rhythm consistency: Is the rhythm staying constant throughout the music? Melechovsky et al. (2023)
**Audio Rendering Quality: ** To check the audio rendering quality for generated audio Melechovsky et al. (2023).
Audio clarity: How close the quality is to a walkie-talkie (worst) or a high-quality studio sound system (best) Schneider et al. (2024).
Style/Genre Analysis: If the generated music can be classified to any genre Mao et al. (2018).
Coherence: Do the music lines sound coherent or not? Liu et al. (2022)
Orchestration: Is the score nicely orchestrated Liu et al. (2022)
A.2 Music Quality
**Rhythm: ** If the note durations and pauses of the melody sound natural or not Sheng et al. (2021).
Diversity/Richness: How diverse and interesting is the generated musical score Liu et al. (2022), Wu and Yang (2020).
**Impression: ** Does the listener remember some part of the melody Wu and Yang (2020).
Humanness: Does the piece resemble expressive human performances? Hsiao et al. (2021)
Chord Progression: Assesses how coherent, pleasant, or reasonable the progression is on its own, independent of melody Harte et al. (2006).
Harmonicity: Measures how well the progression harmonizes with a given melody Harte et al. (2006).
Interestingness: Evaluates how exciting, unexpected, or positively stimulating the progression sounds. These three criteria were used to assess models for melody harmonization task.
Emotionality: How the emotion is perceived in the generated score. Evaluators were asked to place the perceived emotion of each piece on Russell’s circumplex model of affect Imasato et al. (2023).
Innovativeness: Originality in style, timbre, and structural elements
A.3 Adherance to Instrution
Semantic Matching Degree (SMD): How well the generated music matches the expressiveness described by the input text Wang et al. (2024).
Controllability: How well the score is adhering to the musical attributes specified in given prompt/text description Lu et al. (2023).
Music Chord Match and Music Chord Match: measures to what extent the chords and tempo from the generated music match the text prompt respectively Melechovsky et al. (2023).
To evaluate generated lyrics from given melody and vice versa these metrics can be utilized-
**Listenibility: ** Does the lyric sound natural with the melody? Sheng et al. (2021)
Grammaticality: Is the lyric grammaticaly correct? Sheng et al. (2021)
**Meaning: ** If the lyrics seem meaningful or not Sheng et al. (2021).
**Emotion: ** If the emotion of the melody aligns with the lyrics or not Sheng et al. (2021).
Appendix B Evaluation Metrics Definition and Behaviour
Pitch and Rhythm Variations Trieu and Keller (2018) measures the number of unique pitches and note durations within a sequence respetively.
Used Pitch Class (UPC)Dong et al. (2018) is number of used pitch classes per bar.
Qualified Note (QN)Dong et al. (2018) is the proportion of notes that are at least three time steps long (equivalent to a 32nd note or longer). This metric indicates whether the music is too fragmented, with a higher QN suggesting smoother, continuous music.
Drum Pattern (DP)Dong et al. (2018) is the ratio of notes in 8 or 16-beat patterns. The authors suggested that Rock songs frequently use 4/4 beat pattern.
Tonal Distance (TD)Harte et al. (2006) measures harmonicity between two sequences, where a higher tonal distance (TD) indicates weaker harmonic alignment between them.
Qualified Rhythm FrequencyTrieu and Keller (2018) extends Dong et al. (2018)’s Qualified Note metric (which excluded notes shorter than a 32nd note) by measuring how often note durations match standard values (1, 1/2, 1/4, 1/8, 1/16) including dotted, triplet, and tied forms.
Consecutive Pitch Repetitions (CPR)Trieu and Keller (2018) measures the frequency of occurrences of some number of consecutive pitch repetitions. A high CPR represents monotonous repetition in generated music.
Durations of Pitch Repetitions (DPR)Trieu and Keller (2018) measures how often a pitch is repeated for at least some total duration, helping to detect long repetitions.
Tone Spans (TS)Trieu and Keller (2018) counts how often pitch changes exceed a tone distance d (in half-steps).
PolyphonyMogren (2016) measures the frequency of two tones playing simultaneously.
Melody Distance Sheng et al. (2021) computed Melody distance by normalizing note pitches (subtracting the mean) and comparing generated and ground-truth pitch time series of varying lengths using dynamic time warping.
Information Rate (IR)Lattner et al. (2018) is calculated as the mutual information between present and past observations, where high values indicate structured self-similarity in the generated music. The IR metric is estimated using a first-order Markov Chain, contrasting prior entropy with conditional entropy, making it suitable for assessing the repetition structure of musical sequences.
Rhythmic Consistency Huang and Yang (2020) measured the Rhythmic Consistency of their generated Pop music compositions by generating 1,000 sequences and analyzing their beats and downbeats using an RNN-DBN model.
**Chord Coverage ** Yeh et al. (2021) counts how many different chord types appear in a chord sequence by checking non-zero values in the chord histogram. It helps assess whether the model is generating a wide variety of chords or sticking to a limited set.
Chord Tonal Distance (CTD) Yeh et al. (2021) measures the average tonal distance Harte et al. (2006) between each pair of adjacent chords in a sequence. A higher CTD means there are more abrupt changes in the chord progression.
Chord Tone to Non-Chord Tone Ratio (CTnCTR) Yeh et al. (2021) is the ratio of notes that match the underlying chord (chord tones) to those that don’t (non-chord tones). A higher CTnCTR indicates that most notes fit well with the chords.
Pitch consonance score (PCS) Yeh et al. (2021) measures how well melody notes fit with the chords. The average consonance score across 16th-note windows is calculated by checking the musical interval between the melody note and the chord notes.
Extending the idea of tonal distance, Melody-chord tonal distance (MCTD) Yeh et al. (2021) measures the average tonal distance (each distance weighted by the duration of the respective melody note) between each melody note and its corresponding chord label throughout a melody sequence. CC, CTD, CTnCTR, PCS, MCTD help determine how smooth or abrupt chord changes are in the sequence and how well the whole piece harmonizes together.
Alignment accuracy Sheng et al. (2021) measures if the generated melody is accurately aligned with the lyrics by comparing the number of generated tokens with the ground truth.
Variant Proportion (VPi) Wang et al. (2024) calculates the proportion of the i-th type of variant whether the distribution of variant type is reasonable.
Variant Distance (VD) Wang et al. (2024) calculates the average length (in beats) to assess whether the model generates variants correctly.
Similarity Error Yu et al. (2022) evaluates pitch and rhythm by creating note sets per bar (including pitch, duration, and onset), then computing mean intersection-over-union (IoU) similarity across bar pairs. The final score is the difference in mean IoUs between original and generated pieces.
Melody Matchness Yu et al. (2022) calculated Melody Matchness in REMI format by finding the bar wise longest common subsequence between the ground truth and generated piano melodies. Two notes are considered a match if they have the same pitch and their onset times are within an eighth note of each other.
**Pitch Class Histogram Entropy ** Wu and Yang (2020) To calculate pitch histogram entropy, we can create a 12-dimensional pitch class histogram with the notes that appear in a certain period of the music score and calculate the entropy of that histogram.
[TABLE]
where is the Pitch Class Entropy. is the probability of the i-th pitch class (C, C#, D, …, B) occurring in a piece. Low entropy indicates clear tonality with dominant pitch classes, while high entropy suggests unstable, scattered tonality. Chord Histogram Entropy Yeh et al. (2021) applies the same idea to chords.
Pitch and Duration Distribution Similarity Sheng et al. (2021) is the measurement of how similar the pitch and durations distributions are of the generated music and ground truth. First pitch and duration frequency histogram is computed and the similarity is measured by the average overlapped area between the two histograms.
Chroma similarity Wang et al. (2024) For symbolic music, particularly in REMI representation, Chroma similarity or , measures the closeness of two bars of the generated and reference scores in tone via:
[TABLE]
where denotes dot-product and is the chroma vector representing the number of onsets for each of the 12 pitch classes.
Macro Overlapped Area (MOA)von Rütte et al. (2022) Let x and y denote two musical sequences and let and denoting their i-th bars. Feature overlap is computed using the Gaussian distributions of a chosen feature, with overlap given by . Then the macro OA (MOA) between x and y is-
[TABLE]
Chord matchness Yu et al. (2022) measured Chord matchness of the generated piano segment and the target chord in the lead sheet by computing the cosine similarity between their respective chroma vectors.
Average Sample-wise Accuracy (ASA)Lu et al. (2023) is computed by first measuring the proportion of correctly predicted attributes for each sample, then averaging these values across the entire test set.
Dynamics correlation Wu et al. (2024) measures how well a generated audio score matches the dynamic variations (smoothed frame wise loudness) of a reference performance by calculating Pearson’s correlation.
Grooving Pattern Similarity Wu et al. (2023a) between a pair of grooving patterns is calculated by-
[TABLE]
where is the dimensionality.
Structureness Indicators Wu and Yang (2020) quantifies musical repetition by analyzing a fitness scape plot, a matrix where each entry reflects the degree of repetition for a segment of duration centered at time . To capture the most prominent structural repetition within a specific time range , the indicator is defined as .
Chord Accuracy Ren et al. (2020) checks if the conditional chord sequence matches the chords of the generated score by calculating-
[TABLE]
where and are the number of tracks and chords per track respectively.
Appendix C Evaluation Toolkits
Several open-source toolkits are available to facilitate evaluation. For symbolic music- MGEval Yang and Lerch (2020), MusPy Dong et al. (2020), Music21 Cuthbert and Ariza (2010) and JSymbolic McKay et al. (2018) for feature extraction, dataset management, and visualization tools. They support analyzing different features for both absolute and comparative evaluation. For audio music- FAD toolkit555https://github.com/microsoft/fadtk, Stability AI’s code666https://github.com/Stability-AI/stable-audio-metrics for , and Evans et al. (2024) calculation, and Meta’s Audiobox Aesthetics 777https://github.com/facebookresearch/audiobox-aesthetics.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agostinelli et al. (2023) Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. 2023. Musiclm: Generating music from text. ar Xiv preprint ar Xiv:2301.11325 .
- 2Bai et al. (2025) Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, and Nicholas J Bryan. 2025. Dragon: Distributional rewards optimize diffusion generative models. ar Xiv preprint ar Xiv:2504.15217 .
- 3Brunner et al. (2018 a) Gino Brunner, Andres Konrad, Yuyi Wang, and Roger Wattenhofer. 2018 a. Midi-vae: Modeling dynamics and instrumentation of music with applications to style transfer. ar Xiv preprint ar Xiv:1809.07600 .
- 4Brunner et al. (2018 b) Gino Brunner, Yuyi Wang, Roger Wattenhofer, and Sumu Zhao. 2018 b. Symbolic music genre transfer with cyclegan. In 2018 ieee 30th international conference on tools with artificial intelligence (ictai) , pages 786–793. IEEE.
- 5Chang et al. (2024) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology , 15(3):1–45.
- 6Chen et al. (2024) Haonan Chen, Jordan BL Smith, Janne Spijkervet, Ju-Chiang Wang, Pei Zou, Bochen Li, Qiuqiang Kong, and Xingjian Du. 2024. Sympac: Scalable symbolic music generation with prompts and constraints. ar Xiv preprint ar Xiv:2409.03055 .
- 7Chen et al. (2019) Ke Chen, Weilin Zhang, Shlomo Dubnov, Gus Xia, and Wei Li. 2019. The effect of explicit structure encoding of deep neural networks for symbolic music generation. In 2019 International workshop on multilayer music representation and processing (MMRP) , pages 77–84. IEEE.
- 8Choi et al. (2020) Kristy Choi, Curtis Hawthorne, Ian Simon, Monica Dinculescu, and Jesse Engel. 2020. Encoding musical style with transformer autoencoders. In International conference on machine learning , pages 1899–1908. PMLR.
