Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and   Sonic Imitations

Hugo Flores Garc\'ia; Oriol Nieto; Justin Salamon; Bryan Pardo; Prem; Seetharaman

arXiv:2412.08550·cs.SD·April 15, 2025

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

Hugo Flores Garc\'ia, Oriol Nieto, Justin Salamon, Bryan Pardo, Prem, Seetharaman

PDF

Open Access

TL;DR

Sketch2Sound is a lightweight, controllable audio generation model that synthesizes high-quality sounds from interpretable control signals, sonic imitations, and text prompts, enabling flexible sound creation for artists.

Contribution

It introduces a novel method combining time-varying controls and sonic imitations with a lightweight fine-tuning approach on a diffusion transformer.

Findings

01

Able to synthesize sounds following control signals and vocal imitations

02

Retains high audio quality and adherence to text prompts

03

Requires only 40k fine-tuning steps and minimal additional parameters

Abstract

We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Speech and Audio Processing