S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification
Hang Zhao, Chen Zhang, Belei Zhu, Zejun Ma, Kejun Zhang

TL;DR
S3T introduces a novel self-supervised pre-training approach using Swin Transformer and MoCo for music classification, significantly improving accuracy and label efficiency over previous methods.
Contribution
It is the first to combine Swin Transformer with self-supervised learning for music classification, incorporating a new data augmentation pipeline and pre-processors.
Findings
Outperforms previous self-supervised method CLMR by 12.5% top-1 accuracy
Surpasses state-of-the-art supervised methods on music tasks
Achieves high performance with only 10% labeled data
Abstract
In this paper, we propose S3T, a self-supervised pre-training method with Swin Transformer for music classification, aiming to learn meaningful music representations from massive easily accessible unlabeled music data. S3T introduces a momentum-based paradigm, MoCo, with Swin Transformer as its feature extractor to music time-frequency domain. For better music representations learning, S3T contributes a music data augmentation pipeline and two specially designed pre-processors. To our knowledge, S3T is the first method combining the Swin Transformer with a self-supervised learning method for music classification. We evaluate S3T on music genre classification and music tagging tasks with linear classifiers trained on learned representations. Experimental results show that S3T outperforms the previous self-supervised method (CLMR) by 12.5 percents top-1 accuracy and 4.8 percents PR-AUC on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Adam · Stochastic Depth · Label Smoothing · Position-Wise Feed-Forward Layer · Batch Normalization · Dropout
