TL;DR
LaDA-Band introduces a novel non-autoregressive diffusion model for vocal-to-accompaniment music generation, enhancing coherence, authenticity, and orchestration in full-song outputs.
Contribution
It proposes Discrete Masked Diffusion with a dual-track architecture and curriculum training, advancing long-range, detailed, and coherent musical accompaniment generation.
Findings
Outperforms existing methods in acoustic authenticity and coherence
Maintains high-quality accompaniment without auxiliary references
Effective on both academic and real-world benchmarks
Abstract
Vocal-to-accompaniment (V2A) generation, which aims to transform a raw vocal recording into a fully arranged accompaniment, inherently requires jointly addressing an accompaniment trilemma: preserving acoustic authenticity, maintaining global coherence with the vocal track, and producing dynamic orchestration across a full song. Existing open-source approaches typically make compromises among these goals. Continuous-latent generation models can capture long musical spans but often struggle to preserve fine-grained acoustic detail. In contrast, discrete autoregressive models retain local fidelity but suffer from unidirectional generation and error accumulation in extended contexts. We present LaDA-Band, an end-to-end framework that introduces Discrete Masked Diffusion to the V2A task. Our approach formulates V2A generation as Discrete Masked Diffusion, i.e., a global, non-autoregressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
