Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

Georgios Ioannides; Adrian Kieback; Judah Goldfeder; Linsey Pang; Aman Chadha; Aaron Elkins; Yann LeCun; Ravid Shwartz-Ziv

arXiv:2602.09040·eess.AS·February 11, 2026

Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

PDF

Open Access

TL;DR

This paper introduces GMM-Anchored JEPA, a novel self-supervised speech representation learning method that uses a Gaussian Mixture Model for soft clustering, improving performance across multiple speech tasks without iterative re-clustering.

Contribution

It proposes a GMM-based soft clustering approach for JEPA, eliminating the need for iterative re-clustering and enhancing speech representation quality.

Findings

01

Improves ASR WER from 33.22% to 28.68%.

02

Enhances emotion recognition accuracy to 67.76%.

03

Achieves up to 98% entropy in cluster utilization.

Abstract

Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re-clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style baseline with matched compute. Cluster analysis shows GMM-anchored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Face recognition and analysis