SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

Xinhao Mei; Gael Le Lan; Haohe Liu; Zhaoheng Ni; Varun Nagaraja; Yang Liu; Yangyang Shi; Vikas Chandra

arXiv:2601.12594·eess.AS·January 21, 2026

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

Xinhao Mei, Gael Le Lan, Haohe Liu, Zhaoheng Ni, Varun Nagaraja, Yang Liu, Yangyang Shi, Vikas Chandra

PDF

Open Access

TL;DR

SLAP introduces a scalable, multi-objective pretraining approach for language-audio models, handling large datasets and variable audio durations to improve dense audio representations and performance on retrieval and classification tasks.

Contribution

It presents a novel pretraining method that scales to 109 million pairs, integrates multiple training objectives, and supports variable-duration audio for enhanced audio representation learning.

Findings

01

Achieves state-of-the-art results on audio-text retrieval.

02

Improves zero-shot audio classification performance.

03

Handles variable-duration audio effectively.

Abstract

Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing