Spectrograms Are Sequences of Patches
Leyi Zhao, Yi Li

TL;DR
This paper introduces a self-supervised learning approach treating music spectrograms as sequences of patches, leveraging NLP and CV techniques to improve audio representations without labeled data.
Contribution
The work proposes a novel patch-based self-supervised model for music spectrograms, demonstrating the effectiveness of sequential patch modeling in audio tasks.
Findings
Model achieves competitive results on downstream tasks.
Treating spectrograms as patch sequences is effective.
Self-supervised learning reduces reliance on labeled data.
Abstract
Self-supervised pre-training models have been used successfully in several machine learning domains. However, only a tiny amount of work is related to music. In our work, we treat a spectrogram of music as a series of patches and design a self-supervised model that captures the features of these sequential patches: Patchifier, which makes good use of self-supervised learning methods from both NLP and CV domains. We do not use labeled data for the pre-training process, only a subset of the MTAT dataset containing 16k music clips. After pre-training, we apply the model to several downstream tasks. Our model achieves a considerably acceptable result compared to other audio representation models. Meanwhile, our work demonstrates that it makes sense to consider audio as a series of patch segments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
