Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer
Petros Vavaroutsos, Theodoros Palamas, Pantelis Vikatos

TL;DR
This paper introduces a lightweight self-supervised learning approach for music understanding that combines novel architectures and quantization to reduce model size significantly while maintaining competitive performance.
Contribution
It proposes a new combination of architecture and quantization techniques for efficient music information retrieval models, with extensive reproducible experiments.
Findings
Achieves competitive MIR performance with reduced model size
Reduces model size by up to 12.3% compared to state-of-the-art
Demonstrates effectiveness on diverse downstream MIR tasks
Abstract
In recent years, foundation models have become very popular due to their exceptional performance, mainly in natural language (NLP) tasks where they were first introduced. These models usually consist of hundreds of millions, or even billions, of parameters, making them resource-intensive during training and in production systems, leading to increased costs. This paper focuses on the reduction of a foundation's model size when applied to music information retrieval (MIR) tasks. Our research combines the Branchformer architecture with SummaryMixing, which were first applied in speech recognition, along with a random quantization process. To facilitate reproducibility, we conduct pre-training on publicly available datasets, complemented by a proprietary dataset comparable in scale to other private datasets reported in the literature. We ensure robust evaluation by using a framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling
