SLAP: Siamese Language-Audio Pretraining Without Negative Samples for Music Understanding

Julien Guinot; Alain Riou; Elio Quinton; Gy\"orgy Fazekas

arXiv:2506.17815·cs.SD·June 24, 2025

SLAP: Siamese Language-Audio Pretraining Without Negative Samples for Music Understanding

Julien Guinot, Alain Riou, Elio Quinton, Gy\"orgy Fazekas

PDF

1 Repo

TL;DR

SLAP introduces a scalable, negative-sample-free multimodal pretraining framework for music understanding, improving retrieval, classification, and robustness while reducing memory requirements and enabling large-scale training on limited hardware.

Contribution

It adapts BYOL for multimodal audio-text pretraining, addressing modality gap and scalability issues in music understanding models.

Findings

01

Outperforms CLAP on text-music retrieval and zero-shot classification.

02

Achieves competitive MIR task performance with larger or supervised models.

03

Reduces modality gap and improves robustness to batch size variations.

Abstract

Joint embedding spaces have significantly advanced music understanding and generation by linking text and audio through multimodal contrastive learning. However, these approaches face large memory requirement limitations due to relying on large batch sizes to effectively utilize negative samples. Further, multimodal joint embedding spaces suffer from a modality gap wherein embeddings from different modalities lie in different manifolds of the embedding space. To address these challenges, we propose Siamese Language-Audio Pretraining (SLAP), a novel multimodal pretraining framework that allows learning powerful representations without negative samples. SLAP adapts the Bootstrap Your Own Latent (BYOL) paradigm for multimodal audio-text training, promoting scalability in training multimodal embedding spaces. We illustrate the ability of our model to learn meaningful relationships between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pliploop/slap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.