Contrastive Learning from Synthetic Audio Doppelg\"angers
Manuel Cherep, Nikhil Singh

TL;DR
This paper introduces a contrastive learning method using synthetic audio generated by perturbing sound synthesizer parameters, which improves audio representation quality and reduces data requirements.
Contribution
It presents a novel approach of using synthetic audio pairs for contrastive learning, outperforming real data methods and requiring minimal hyperparameters.
Findings
Synthetic audio pairs enhance contrastive learning effectiveness.
Method outperforms real data-based approaches on standard tasks.
Approach is lightweight with no data storage needs.
Abstract
Learning robust audio representations currently demands extensive datasets of real-world sound recordings. By applying artificial transformations to these recordings, models can learn to recognize similarities despite subtle variations through techniques like contrastive learning. However, these transformations are only approximations of the true diversity found in real-world sounds, which are generated by complex interactions of physical processes, from vocal cord vibrations to the resonance of musical instruments. We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio. By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelg\"angers-synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes. These variations, difficult to achieve through augmentations of existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques
MethodsContrastive Learning
