Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

Heehwan Wang; Joonwoo Kwon; Sooyoung Kim; Jungwoo Seo; Shinjae Yoo; Yuewei Lin; Jiook Cha

arXiv:2411.15913·cs.SD·May 14, 2026

Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

Heehwan Wang, Joonwoo Kwon, Sooyoung Kim, Jungwoo Seo, Shinjae Yoo, Yuewei Lin, Jiook Cha

PDF

1 Repo

TL;DR

Stylus is a training-free framework that adapts pretrained image diffusion models for high-fidelity music style transfer on Mel-spectrograms, outperforming existing methods in content preservation and perceptual quality.

Contribution

It introduces a novel approach to repurpose image diffusion models for music style transfer without additional training, using style key-value injection and phase-preserving reconstruction.

Findings

01

Outperforms state-of-the-art baselines with 34.1% higher content preservation.

02

Achieves 25.7% better perceptual quality in evaluations.

03

Demonstrates effective use of generic image priors for audio transformation.

Abstract

Music style transfer blends source structure with reference style to enable personalized music creation. However, existing zero-shot methods often struggle to capture fine-grained audio nuances, relying on coarse text descriptions or requiring expensive task-specific training. We propose Stylus, a training-free framework that repurposes pretrained image diffusion models for music style transfer in the Mel-spectrogram domain. By treating audio as structured time-frequency images, Stylus manipulates self-attention by injecting style keys and values while preserving source structural queries. To ensure high fidelity, we introduce a phase-preserving reconstruction strategy to mitigate spectrogram inversion artifacts, alongside a classifier-free-guidance-inspired control for adjustable stylization. Extensive evaluations including 2,925 human ratings demonstrate that Stylus outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Sooyyoungg/Stylus.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.