Factual and Musical Evaluation Metrics for Music Language Models

Daniel Chenyu Lin; Michael Freeman; John Thickstun

arXiv:2511.05550·cs.SD·November 11, 2025

Factual and Musical Evaluation Metrics for Music Language Models

Daniel Chenyu Lin, Michael Freeman, John Thickstun

PDF

Open Access 3 Reviews

TL;DR

This paper highlights the inadequacy of existing evaluation metrics for music language models and proposes new metrics and frameworks to better assess their factual correctness and performance.

Contribution

It introduces a domain-adapted evaluation metric and a factual correctness framework for Music LMs, addressing limitations of traditional linguistic metrics.

Findings

01

Existing metrics like BLEU and BERTScore do not measure factual correctness.

02

Proposed evaluation framework is modality-agnostic and generalizable.

03

Experiments on open datasets demonstrate the effectiveness of the new metrics.

Abstract

Music language models (Music LMs), like vision language models, leverage multimodal representations to answer natural language queries about musical audio recordings. Although Music LMs are reportedly improving, we find that current evaluations fail to capture whether their answers are correct. Specifically, for all Music LMs that we examine, widely-used evaluation metrics such as BLEU, METEOR, and BERTScore fail to measure anything beyond linguistic fluency of the model's responses. To measure the true performance of Music LMs, we propose (1) a better general-purpose evaluation metric for Music LMs adapted to the music domain and (2) a factual evaluation framework to quantify the correctness of a Music LM's responses. Our framework is agnostic to the modality of the question-answering model and could be generalized to quantify performance in other open-ended question-answering domains.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

Clear problem framing. Highlights a real gap: NLP metrics reward fluency, not audio-grounded correctness. Simple, reusable tools. CLAPText is a practical drop-in; the factual QA pipeline is modality-agnostic. Baseline design is insightful. Random-audio and adversarial paraphrase conditions stress-test audio use vs. surface text overlap. Open datasets/code intent. Encourages reproducibility and broader adoption.

Weaknesses

Questionable data quality for MusicQA. A large portion of prompts/answers stems from MPT-7B generation; prior work reports hallucinations and low musician approval (e.g., MusiLingo) — weakening conclusions about “correct” vs. “random” gaps when “correct” itself is noisy. Model coverage is thin. Aside from SALMONN, evaluated models are not the strongest current baselines; conclusions about metric validity would be more convincing with Qwen-Audio, Qwen2.5-Omni, ChatGPT-5, Gemini 2.5 Pro, etc. Ad

Reviewer 02Rating 4Confidence 3

Strengths

- The authors' motivation is strong, as the current state of evaluation for Music LMs being rather dubious. - The construction of the factuality evaluation protocol in particular is reasonably strong. Such a framework gets at the heart of the failure modes of many modern Music LMs.

Weaknesses

- It feels as if this paper is caught between two similar yet practically orthogonal contributions: ClapText and their factual evaluation protocol. This muddies the overall flow and contribution of the paper, as substantially more content is dedicated to the weaker results (ClapText) over the correctness evaluation. - Overall, the evaluation with ClapText is not massively convincing. First off, it is unclear why the authors opted for random sampling of audio prompts rather than Gaussian noise (a

Reviewer 03Rating 2Confidence 3

Strengths

- This paper provides a valuable analysis of the music domain, addressing the limitations of commonly used metrics such as BLEU in effectively evaluating models in this context. - The proposal to use CLAPText as an evaluation metric is a reasonable and intuitive approach. Furthermore, the experimental results with adversarial text offer a suggestive and insightful contribution to the field.

Weaknesses

- The highest-scoring case (paraphrase) and the lowest-scoring cases (adversarial/random) have been evaluated. However, as a validation of the evaluation metric, it is also necessary to confirm whether diverse cases can be appropriately ranked in order. For example, when there are partial differences (e.g., only some instruments are incorrect, or the information is partially correct but includes additional new incorrect information), the score should change gradually to reflect these differences

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing