The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?
Pedro Ramoneda, Emilia Parada-Cabaleiro, Benno Weck, Xavier Serra

TL;DR
This paper evaluates the reliability of Large Language Models in musicology, proposing a semi-automatic benchmarking method and highlighting the need for domain-specific models to improve trustworthiness.
Contribution
It introduces a semi-automatic benchmarking approach for LLMs in musicology and emphasizes the importance of specialized models with accurate domain knowledge.
Findings
Vanilla LLMs are less reliable than retrieval-augmented models.
A benchmark of 400 questions shows current LLM limitations.
Domain-specific LLMs could enhance reliability in musicology.
Abstract
In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies
