The Interpretation Gap in Text-to-Music Generation Models
Yongyi Zang, Yixiao Zhang

TL;DR
This paper identifies a key gap in text-to-music models related to their ability to interpret musician controls, proposing a framework and strategies to enhance human-AI musical collaboration.
Contribution
It introduces a framework for musical interaction and highlights the interpretation stage as a critical gap, suggesting new strategies to improve model-human collaboration.
Findings
Highlights the interpretation gap in current models
Proposes a framework for musical interaction stages
Calls for research on interpretation in music AI
Abstract
Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Natural Language Processing Techniques
