Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning
Ludovic Tuncay (IRIT-SAMoVA), Etienne Labb\'e (IRIT-SAMoVA), Emmanouil Benetos (QMUL), Thomas Pellegrini (IRIT-SAMoVA)

TL;DR
Audio-JEPA introduces a self-supervised learning framework for audio that predicts masked spectrogram patches using a Vision Transformer, achieving competitive results with less training data compared to existing models.
Contribution
It adapts the JEPA paradigm to audio data, demonstrating effective self-supervised pre-training on spectrograms with minimal hyper-parameter tuning.
Findings
Comparable performance to wav2vec 2.0 and data2vec
Uses less than one-fifth of the training data
No hyper-parameter tuning required
Abstract
Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio-JEPA (Audio Joint-Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked spectrogram patches rather than reconstructing raw audio. We pre-train on unlabeled AudioSet clips (10s, 32kHz) with random patch masking on mel-spectrograms. We evaluate on the X-ARES suite covering speech, music, and environmental sound tasks. Although our implementation is a straightforward translation of the original model to audio, the results still show comparable performance to wav2vec 2.0 and data2vec while using less than one-fifth of their training data and with no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
