TL;DR
AudioMosaic introduces a contrastive self-supervised learning method for audio that uses structured masking to efficiently learn transferable, discriminative representations, achieving state-of-the-art results across multiple audio benchmarks.
Contribution
It presents a novel contrastive learning approach with structured masking for audio, enabling efficient large-batch training and improved generalization over generative methods.
Findings
Achieves state-of-the-art results on standard audio benchmarks.
Learns more discriminative utterance-level representations.
Improves performance on audio-language tasks when integrated into models.
Abstract
Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗hanxunh/AudioMosaic-vit-b16-pretrainedmodel· 88 dl· ♡ 188 dl♡ 1
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-as20kmodel· 93 dl93 dl
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-as2mmodel· 79 dl79 dl
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-spc1model· 73 dl73 dl
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-spc2model· 78 dl78 dl
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-esc-split1model· 75 dl75 dl
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-esc-split2model· 86 dl86 dl
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-esc-split3model· 78 dl78 dl
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-esc-split4model· 65 dl65 dl
- 🤗hanxunh/AudioMosaic-vit-b16-finetune-esc-split5model· 71 dl71 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
