Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval

Yuexuan Kong; Vincent Lostanlen; Romain Hennequin; Mathieu Lagrange; Gabriel Meseguer-Brocal

arXiv:2507.12996·cs.SD·July 18, 2025

Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval

Yuexuan Kong, Vincent Lostanlen, Romain Hennequin, Mathieu Lagrange, Gabriel Meseguer-Brocal

PDF

Open Access

TL;DR

This paper introduces a novel multi-class-token Vision Transformer architecture trained with multitask self-supervised learning to improve music information retrieval across various tasks, outperforming single-task models.

Contribution

The paper proposes a multi-class-token Vision Transformer (MT2) that combines contrastive and equivariant learning for multitask self-supervised music analysis, demonstrating superior performance and efficiency.

Findings

01

Outperforms single-task models on multiple MIR tasks.

02

Achieves 18x fewer parameters than comparable models.

03

Demonstrates versatility across diverse music analysis tasks.

Abstract

Contrastive learning and equivariant learning are effective methods for self-supervised learning (SSL) for audio content analysis. Yet, their application to music information retrieval (MIR) faces a dilemma: the former is more effective on tagging (e.g., instrument recognition) but less effective on structured prediction (e.g., tonality estimation); The latter can match supervised methods on the specific task it is designed for, but it does not generalize well to other tasks. In this article, we adopt a best-of-both-worlds approach by training a deep neural network on both kinds of pretext tasks at once. The proposed new architecture is a Vision Transformer with 1-D spectrogram patches (ViT-1D), equipped with two class tokens, which are specialized to different self-supervised pretext tasks but optimized through the same model: hence the qualification of self-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing