MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

Yunkee Chae; Kyogu Lee

arXiv:2505.23305·cs.SD·October 21, 2025

MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

Yunkee Chae, Kyogu Lee

PDF

Open Access

TL;DR

MGE-LDM introduces a unified latent diffusion model capable of simultaneous music generation, source imputation, and flexible source separation, trained across diverse datasets without fixed instrument categories.

Contribution

The paper presents a novel joint latent diffusion framework that handles multiple music tasks and sources in a class-agnostic manner, unlike prior fixed-category models.

Findings

01

Supports complete and partial music generation.

02

Enables text-conditioned source extraction.

03

Trained on heterogeneous multi-track datasets.

Abstract

We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsInpainting · Diffusion