Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches
Junyoung Koh

TL;DR
This paper investigates the role of auxiliary conditioning branches in text-to-music diffusion models, demonstrating their importance for performance and ranking first in a major challenge.
Contribution
It reveals that auxiliary branches serve as architectural anchors, improving model quality even without explicit conditioning signals, and achieves top results in a text-to-music challenge.
Findings
Models without auxiliary branches score lower on multiple metrics.
Reinvesting parameters as additional depth yields marginal gains.
Our submissions ranked first and second in the ICME 2026 challenge.
Abstract
Text-to-music generation has advanced rapidly, with modern autoregressive and diffusion-based models producing convincing music from natural-language prompts. However, much of this progress relies on large-scale training data and external pretraining, making it difficult to isolate which design choices remain effective when data and pretraining are controlled. We study this setting using a Diffusion Transformer backbone with lyric and timbre conditioning, adapted to an instrumental-only text-to-music task in which the auxiliary lyric and timbre branches receive only degenerate conditioning signals. Through controlled ablations, we find that models retrained without these branches score lower across AudioBox aesthetics, LLM-as-judge, and human MOS, and that reinvesting the saved parameters as additional DiT depth recovers only marginally. This suggests the auxiliary branches may act as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
