Stemphonic: All-at-once Flexible Multi-stem Music Generation

Shih-Lun Wu; Ge Zhu; Juan-Pablo Caceres; Cheng-Zhi Anna Huang; Nicholas J. Bryan

arXiv:2602.09891·cs.SD·February 11, 2026

Stemphonic: All-at-once Flexible Multi-stem Music Generation

Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres, Cheng-Zhi Anna Huang, Nicholas J. Bryan

PDF

Open Access

TL;DR

Stemphonic is a novel multi-stem music generation framework that produces synchronized instrument stems in one inference pass, offering greater control and faster results than previous methods.

Contribution

It introduces a diffusion-/flow-based approach that generates variable sets of synchronized stems simultaneously, enabling flexible, high-quality, and efficient multi-stem music synthesis.

Findings

01

Produces higher-quality stems than existing methods.

02

Accelerates full mix generation by 25 to 50%.

03

Supports user-controlled, iterative stem orchestration.

Abstract

Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis