AudioMosaic: Contrastive Masked Audio Representation Learning

Hanxun Huang; Qizhou Wang; Xingjun Ma; Cihang Xie; Christopher Leckie; Sarah Erfani

arXiv:2605.14231·cs.LG·May 15, 2026

AudioMosaic: Contrastive Masked Audio Representation Learning

Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani

PDF

1 Repo 15 Models

TL;DR

AudioMosaic introduces a contrastive self-supervised learning method for audio that uses structured masking to efficiently learn transferable, discriminative representations, achieving state-of-the-art results across multiple audio benchmarks.

Contribution

It presents a novel contrastive learning approach with structured masking for audio, enabling efficient large-batch training and improved generalization over generative methods.

Findings

01

Achieves state-of-the-art results on standard audio benchmarks.

02

Learns more discriminative utterance-level representations.

03

Improves performance on audio-language tasks when integrated into models.

Abstract

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HanxunH/AudioMosaic
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.