TimbreCLIP: Connecting Timbre to Text and Images
Nicolas Jonason, Bob L.T. Sturm

TL;DR
TimbreCLIP introduces a cross-modal embedding connecting musical instrument timbre to text and images, enabling applications like text-driven audio equalization and timbre to image synthesis.
Contribution
It presents a novel audio-text embedding trained on instrument notes, demonstrating its utility in cross-modal retrieval and creative audio-visual tasks.
Findings
Effective cross-modal retrieval on synth patches
Successful application in text-driven audio equalization
Timbre to image generation demonstrated
Abstract
We present work in progress on TimbreCLIP, an audio-text cross modal embedding trained on single instrument notes. We evaluate the models with a cross-modal retrieval task on synth patches. Finally, we demonstrate the application of TimbreCLIP on two tasks: text-driven audio equalization and timbre to image generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
