Exploring the Feasibility of LLMs for Automated Music Emotion Annotation
Meng Yang, Jon McCormack, Maria Teresa Llano, Wanchao Su

TL;DR
This paper investigates using GPT-4o, a large language model, for automated music emotion annotation, comparing its performance to human experts and assessing its reliability and potential as a scalable alternative.
Contribution
It demonstrates the feasibility of employing GPT-4o for music emotion annotation and provides a comprehensive evaluation of its performance relative to human experts.
Findings
GPT-4o's annotations are less nuanced than humans.
Inter-rater reliability of GPT-4o is comparable to human disagreement.
GPT-4o offers a cost-effective, scalable annotation method.
Abstract
Current approaches to music emotion annotation remain heavily reliant on manual labelling, a process that imposes significant resource and labour burdens, severely limiting the scale of available annotated data. This study examines the feasibility and reliability of employing a large language model (GPT-4o) for music emotion annotation. In this study, we annotated GiantMIDI-Piano, a classical MIDI piano music dataset, in a four-quadrant valence-arousal framework using GPT-4o, and compared against annotations provided by three human experts. We conducted extensive evaluations to assess the performance and reliability of GPT-generated music emotion annotations, including standard accuracy, weighted accuracy that accounts for inter-expert agreement, inter-annotator agreement metrics, and distributional similarity of the generated labels. While GPT's annotation performance fell short of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
