Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems
Leduo Chen, Junchuan Zhao, Shengchen Li

TL;DR
This paper introduces MixtureTT, a diffusion-based system for multi-instrument timbre transfer directly from polyphonic mixtures, outperforming single-instrument methods and reducing artifacts.
Contribution
It presents the first joint multi-stem timbre transfer system that models dependencies across stems, improving coherence and efficiency over prior separate-then-transfer approaches.
Findings
MixtureTT outperforms single-instrument baselines on objective metrics.
Joint modeling reduces inference cost proportionally to the number of stems.
The system produces more coherent multi-stem outputs despite harder input conditions.
Abstract
Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
