S3T: Self-Supervised Pre-training with Swin Transformer for Music   Classification

Hang Zhao; Chen Zhang; Belei Zhu; Zejun Ma; Kejun Zhang

arXiv:2202.10139·eess.AS·February 22, 2022·1 cites

S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification

Hang Zhao, Chen Zhang, Belei Zhu, Zejun Ma, Kejun Zhang

PDF

Open Access 1 Repo

TL;DR

S3T introduces a novel self-supervised pre-training approach using Swin Transformer and MoCo for music classification, significantly improving accuracy and label efficiency over previous methods.

Contribution

It is the first to combine Swin Transformer with self-supervised learning for music classification, incorporating a new data augmentation pipeline and pre-processors.

Findings

01

Outperforms previous self-supervised method CLMR by 12.5% top-1 accuracy

02

Surpasses state-of-the-art supervised methods on music tasks

03

Achieves high performance with only 10% labeled data

Abstract

In this paper, we propose S3T, a self-supervised pre-training method with Swin Transformer for music classification, aiming to learn meaningful music representations from massive easily accessible unlabeled music data. S3T introduces a momentum-based paradigm, MoCo, with Swin Transformer as its feature extractor to music time-frequency domain. For better music representations learning, S3T contributes a music data augmentation pipeline and two specially designed pre-processors. To our knowledge, S3T is the first method combining the Swin Transformer with a self-supervised learning method for music classification. We evaluate S3T on music genre classification and music tagging tasks with linear classifiers trained on learned representations. Experimental results show that S3T outperforms the previous self-supervised method (CLMR) by 12.5 percents top-1 accuracy and 4.8 percents PR-AUC on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cgaroufis/mscol_smc23
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Adam · Stochastic Depth · Label Smoothing · Position-Wise Feed-Forward Layer · Batch Normalization · Dropout