MusiLingo: Bridging Music and Text with Pre-trained Language Models for   Music Captioning and Query Response

Zihao Deng; Yinghao Ma; Yudong Liu; Rongchen Guo; Ge Zhang; Wenhu; Chen; Wenhao Huang; Emmanouil Benetos

arXiv:2309.08730·eess.AS·April 3, 2024

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu, Chen, Wenhao Huang, Emmanouil Benetos

PDF

Open Access 1 Repo 4 Models 1 Datasets

TL;DR

MusiLingo is a system that connects music audio and text using pre-trained language models, enabling effective music captioning and query responses by aligning music representations with textual contexts.

Contribution

It introduces MusiLingo, a novel approach that aligns music audio representations with language models, and creates the MusicInstruct dataset for music-related question answering.

Findings

01

Competitive performance in music captioning

02

Effective music-related query response generation

03

Creation of the MusicInstruct dataset

Abstract

Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zihaod/musilingo
pytorchOfficial

Models

Datasets

m-a-p/Music-Instruct
dataset· 47 dl
47 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis

MethodsALIGN