Synthesizer Sound Matching Using Audio Spectrogram Transformers
Fred Bruford, Frederik Blang, and Shahan Nercessian

TL;DR
This paper presents a novel Audio Spectrogram Transformer-based model for synthesizer sound matching, capable of accurately emulating input sounds and outperforming baseline neural networks, with applications in diverse musical contexts.
Contribution
Introduces a transformer-based sound matching model for synthesizers that generalizes across different sounds and synthesizer types, improving fidelity over traditional neural network approaches.
Findings
Model outperforms MLP and CNN baselines in parameter reconstruction.
Capable of emulating vocal imitations and sounds from various synthesizers.
Demonstrates robustness in out-of-domain sound matching.
Abstract
Systems for synthesizer sound matching, which automatically set the parameters of a synthesizer to emulate an input sound, have the potential to make the process of synthesizer programming faster and easier for novice and experienced musicians alike, whilst also affording new means of interaction with synthesizers. Considering the enormous variety of synthesizers in the marketplace, and the complexity of many of them, general-purpose sound matching systems that function with minimal knowledge or prior assumptions about the underlying synthesis architecture are particularly desirable. With this in mind, we introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer. We demonstrate the viability of this model by training on a large synthetic dataset of randomly generated samples from the popular Massive synthesizer. We show that this model can reconstruct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsSparse Evolutionary Training · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention
