Creative Text-to-Audio Generation via Synthesizer Programming
Manuel Cherep, Nikhil Singh, Jessica Shand

TL;DR
This paper introduces CTAG, a novel text-to-audio generation approach using a virtual synthesizer with 78 parameters, enabling intuitive, editable sound creation from text prompts, contrasting with complex neural models.
Contribution
The paper presents a new method that leverages a simple, interpretable synthesizer for text-to-audio generation, allowing easy inspection and tweaking of generated sounds.
Findings
Produces high-quality, distinctive sounds from text prompts.
Allows for easy inspection and editing of sound parameters.
Generates abstract, conceptual audio similar to neural methods.
Abstract
Neural audio synthesis methods now allow specifying ideas in natural language. However, these methods produce results that cannot be easily tweaked, as they are based on large latent spaces and up to billions of uninterpretable parameters. We propose a text-to-audio generation method that leverages a virtual modular sound synthesizer with only 78 parameters. Synthesizers have long been used by skilled sound designers for media like music and film due to their flexibility and intuitive controls. Our method, CTAG, iteratively updates a synthesizer's parameters to produce high-quality audio renderings of text prompts that can be easily inspected and tweaked. Sounds produced this way are also more abstract, capturing essential conceptual features over fine-grained acoustic details, akin to how simple sketches can vividly convey visual concepts. Our results show how CTAG produces sounds that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Artificial Intelligence in Games
