Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer

Petros Vavaroutsos; Theodoros Palamas; Pantelis Vikatos

arXiv:2601.09603·cs.SD·January 15, 2026

Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer

Petros Vavaroutsos, Theodoros Palamas, Pantelis Vikatos

PDF

Open Access

TL;DR

This paper introduces a lightweight self-supervised learning approach for music understanding that combines novel architectures and quantization to reduce model size significantly while maintaining competitive performance.

Contribution

It proposes a new combination of architecture and quantization techniques for efficient music information retrieval models, with extensive reproducible experiments.

Findings

01

Achieves competitive MIR performance with reduced model size

02

Reduces model size by up to 12.3% compared to state-of-the-art

03

Demonstrates effectiveness on diverse downstream MIR tasks

Abstract

In recent years, foundation models have become very popular due to their exceptional performance, mainly in natural language (NLP) tasks where they were first introduced. These models usually consist of hundreds of millions, or even billions, of parameters, making them resource-intensive during training and in production systems, leading to increased costs. This paper focuses on the reduction of a foundation's model size when applied to music information retrieval (MIR) tasks. Our research combines the Branchformer architecture with SummaryMixing, which were first applied in speech recognition, along with a random quantization process. To facilitate reproducibility, we conduct pre-training on publicly available datasets, complemented by a proprietary dataset comparable in scale to other private datasets reported in the literature. We ensure robust evaluation by using a framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling