JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

Georgios Ioannides; Christos Constantinou; Aman Chadha; Aaron Elkins; Linsey Pang; Ravid Shwartz-Ziv; Yann LeCun

arXiv:2512.07168·cs.SD·December 9, 2025

JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang, Ravid Shwartz-Ziv, Yann LeCun

PDF

Open Access 1 Datasets

TL;DR

This paper presents a novel self-supervised framework combining JEPA and DAAM to learn robust, hierarchical speech representations that enable efficient tokenization and high-quality waveform reconstruction, advancing neural audio coding.

Contribution

It introduces a density-adaptive attention mechanism into JEPA for adaptive feature selection and hierarchical speech structure discovery at low frame rates.

Findings

01

Tokens are highly compressed and reversible.

02

Model achieves competitive audio coding performance.

03

Hierarchical speech structure is effectively captured.

Abstract

We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Kylan12/Synthetic-AI-ML-Dataset
dataset· 42 dl
42 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing