DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Zhuoyuan Mao; Mengjie Zhao; Qiyu Wu; Hiromi Wakaki; Yuki Mitsufuji

arXiv:2502.12623·cs.SD·September 24, 2025

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

PDF

Open Access 1 Datasets

TL;DR

DeepResonance is a multimodal music understanding model that integrates music, text, images, and videos through multi-way instruction tuning, achieving state-of-the-art results across various tasks.

Contribution

The paper introduces DeepResonance, a novel multimodal music understanding LLM with multi-way instruction tuning and new datasets, enhancing integration of visual and textual music features.

Findings

01

Achieves state-of-the-art performance on six music understanding tasks.

02

Effectively integrates visual and textual modalities for improved understanding.

03

Demonstrates the benefits of auxiliary modalities in music comprehension.

Abstract

Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Sony/DeepResonance_data_models
dataset· 110 dl
110 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax