DisMix: Disentangling Mixtures of Musical Instruments for Source-level   Pitch and Timbre Manipulation

Yin-Jyun Luo; Kin Wai Cheuk; Woosung Choi; Toshimitsu Uesaka; Keisuke; Toyama; Koichi Saito; Chieh-Hsin Lai; Yuhta Takida; Wei-Hsiang Liao; Simon; Dixon; Yuki Mitsufuji

arXiv:2408.10807·cs.SD·August 21, 2024

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke, Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon, Dixon, Yuki Mitsufuji

PDF

Open Access

TL;DR

DisMix is a novel generative framework that disentangles pitch and timbre in multi-instrument music mixtures, enabling manipulation and synthesis of new instrument combinations and musical attributes.

Contribution

It introduces a modular approach for source-level pitch and timbre disentanglement in multi-instrument music, filling a gap in existing single-instrument focused methods.

Findings

01

Successfully disentangles pitch and timbre in multi-instrument mixtures

02

Enables manipulation of instrument attributes and creation of novel instrument combinations

03

Effective on both simple chords and complex chorale datasets

Abstract

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Computer Graphics and Visualization Techniques

MethodsSparse Evolutionary Training · Diffusion