LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation
Tom Baker, Javier Nistal

TL;DR
LiLAC introduces a lightweight, modular control architecture for musical audio generation that maintains high quality and flexibility while significantly reducing memory requirements compared to existing ControlNet models.
Contribution
The paper presents a novel, lightweight architecture for controlling music generation that is more efficient and flexible than traditional ControlNet approaches.
Findings
Achieves comparable audio quality to ControlNet
Reduces parameter count and memory usage significantly
Enables flexible, independent control of musical features
Abstract
Text-to-audio diffusion models produce high-quality and diverse music but many, if not most, of the SOTA models lack the fine-grained, time-varying controls essential for music production. ControlNet enables attaching external controls to a pre-trained generative model by cloning and fine-tuning its encoder on new conditionings. However, this approach incurs a large memory footprint and restricts users to a fixed set of controls. We propose a lightweight, modular architecture that considerably reduces parameter count while matching ControlNet in audio quality and condition adherence. Our method offers greater flexibility and significantly lower memory usage, enabling more efficient training and deployment of independent controls. We conduct extensive objective and subjective evaluations and provide numerous audio examples on the accompanying website at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
MethodsDiffusion · Sparse Evolutionary Training
