Evaluating Pretrained General-Purpose Audio Representations for Music Genre Classification
Kashish Rai, Mrinmoy Bhattacharjee

TL;DR
This paper evaluates self-supervised audio embeddings, especially BYOL-A, for music genre classification, demonstrating superior performance over other models and exploring training strategies and dataset integration.
Contribution
It introduces the use of BYOL-A embeddings with a deep neural network for improved music genre classification and investigates training techniques and dataset combination effects.
Findings
BYOL-A outperforms PANNs and VGGish in accuracy.
Deep neural network classifiers outperform linear classifiers.
Joint training on combined datasets yields comparable results.
Abstract
This study investigates the use of self-supervised learning embeddings, particularly BYOL-A, in conjunction with a deep neural network classifier for Music Genre Classification. Our experiments demonstrate that BYOL-A embeddings outperform other pre-trained models, such as PANNs and VGGish, achieving an accuracy of 81.5% on the GTZAN dataset and 64.3% on FMA-Small. The proposed DNN classifier improved performance by 10-16% over linear classifiers. We explore the effects of contrastive and triplet loss and multitask training with optimized loss weights, achieving the highest accuracy. To address cross dataset challenges, we combined GTZAN and FMA-Small into a unified 18-class label space for joint training, resulting in slight performance drops on GTZAN but comparable results on FMA-Small. The scripts developed in this work are publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition
