Pre-training with Synthetic Patterns for Audio
Yuchi Ishikawa, Tatsuya Komatsu, Yoshimitsu Aoki

TL;DR
This paper introduces a novel pre-training method for audio encoders using synthetic patterns and Masked Autoencoders, enabling effective learning without real audio data and addressing privacy concerns.
Contribution
It presents a new framework combining MAEs with synthetic data for pre-training audio models, avoiding reliance on real audio datasets.
Findings
Achieves performance comparable to models trained on large real audio datasets
Partially outperforms image-based pre-training methods
Effective across 13 audio tasks and 17 synthetic datasets
Abstract
In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
MethodsFocus
