The Interpretation Gap in Text-to-Music Generation Models

Yongyi Zang; Yixiao Zhang

arXiv:2407.10328·cs.SD·July 16, 2024

The Interpretation Gap in Text-to-Music Generation Models

Yongyi Zang, Yixiao Zhang

PDF

Open Access

TL;DR

This paper identifies a key gap in text-to-music models related to their ability to interpret musician controls, proposing a framework and strategies to enhance human-AI musical collaboration.

Contribution

It introduces a framework for musical interaction and highlights the interpretation stage as a critical gap, suggesting new strategies to improve model-human collaboration.

Findings

01

Highlights the interpretation gap in current models

02

Proposes a framework for musical interaction stages

03

Calls for research on interpretation in music AI

Abstract

Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Natural Language Processing Techniques