Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning

Ludovic Tuncay (IRIT-SAMoVA); Etienne Labb\'e (IRIT-SAMoVA); Emmanouil Benetos (QMUL); Thomas Pellegrini (IRIT-SAMoVA)

arXiv:2507.02915·cs.SD·July 8, 2025

Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning

Ludovic Tuncay (IRIT-SAMoVA), Etienne Labb\'e (IRIT-SAMoVA), Emmanouil Benetos (QMUL), Thomas Pellegrini (IRIT-SAMoVA)

PDF

TL;DR

Audio-JEPA introduces a self-supervised learning framework for audio that predicts masked spectrogram patches using a Vision Transformer, achieving competitive results with less training data compared to existing models.

Contribution

It adapts the JEPA paradigm to audio data, demonstrating effective self-supervised pre-training on spectrograms with minimal hyper-parameter tuning.

Findings

01

Comparable performance to wav2vec 2.0 and data2vec

02

Uses less than one-fifth of the training data

03

No hyper-parameter tuning required

Abstract

Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio-JEPA (Audio Joint-Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked spectrogram patches rather than reconstructing raw audio. We pre-train on unlabeled AudioSet clips (10s, 32kHz) with random patch masking on mel-spectrograms. We evaluate on the X-ARES suite covering speech, music, and environmental sound tasks. Although our implementation is a straightforward translation of the original model to audio, the results still show comparable performance to wav2vec 2.0 and data2vec while using less than one-fifth of their training data and with no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.