Low-Resource Guidance for Controllable Latent Audio Diffusion

Zachary Novack; Zack Zukowski; CJ Carr; Julian Parker; Zach Evans; Josiah Taylor; Taylor Berg-Kirkpatrick; Julian McAuley; Jordi Pons

arXiv:2603.04366·cs.SD·March 5, 2026

Low-Resource Guidance for Controllable Latent Audio Diffusion

Zachary Novack, Zack Zukowski, CJ Carr, Julian Parker, Zach Evans, Josiah Taylor, Taylor Berg-Kirkpatrick, Julian McAuley, Jordi Pons

PDF

Open Access

TL;DR

This paper introduces a low-resource, guidance-based method for controllable latent audio diffusion that reduces computational costs by operating directly in latent space, enabling effective control over audio attributes with minimal training.

Contribution

The authors propose LatCHs, a novel approach that controls latent audio diffusion models efficiently without decoder backpropagation, requiring minimal training resources.

Findings

01

Effective control over audio attributes like pitch and beats.

02

Maintains high audio quality with lower computational costs.

03

Operates with minimal training parameters and time.

Abstract

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies