Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations
Hugo Flores Garc\'ia, Oriol Nieto, Justin Salamon, Bryan Pardo, Prem, Seetharaman

TL;DR
Sketch2Sound is a lightweight, controllable audio generation model that synthesizes high-quality sounds from interpretable control signals, sonic imitations, and text prompts, enabling flexible sound creation for artists.
Contribution
It introduces a novel method combining time-varying controls and sonic imitations with a lightweight fine-tuning approach on a diffusion transformer.
Findings
Able to synthesize sounds following control signals and vocal imitations
Retains high audio quality and adherence to text prompts
Requires only 40k fine-tuning steps and minimal additional parameters
Abstract
We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Speech and Audio Processing
