TL;DR
Stylus is a training-free framework that adapts pretrained image diffusion models for high-fidelity music style transfer on Mel-spectrograms, outperforming existing methods in content preservation and perceptual quality.
Contribution
It introduces a novel approach to repurpose image diffusion models for music style transfer without additional training, using style key-value injection and phase-preserving reconstruction.
Findings
Outperforms state-of-the-art baselines with 34.1% higher content preservation.
Achieves 25.7% better perceptual quality in evaluations.
Demonstrates effective use of generic image priors for audio transformation.
Abstract
Music style transfer blends source structure with reference style to enable personalized music creation. However, existing zero-shot methods often struggle to capture fine-grained audio nuances, relying on coarse text descriptions or requiring expensive task-specific training. We propose Stylus, a training-free framework that repurposes pretrained image diffusion models for music style transfer in the Mel-spectrogram domain. By treating audio as structured time-frequency images, Stylus manipulates self-attention by injecting style keys and values while preserving source structural queries. To ensure high fidelity, we introduce a phase-preserving reconstruction strategy to mitigate spectrogram inversion artifacts, alongside a classifier-free-guidance-inspired control for adjustable stylization. Extensive evaluations including 2,925 human ratings demonstrate that Stylus outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
