Towards Early Prediction of Self-Supervised Speech Model Performance

Ryan Whetten; Lucas Maison; Titouan Parcollet; Marco Dinarelli; Yannick Est\`eve

arXiv:2501.05966·cs.SD·June 3, 2025

Towards Early Prediction of Self-Supervised Speech Model Performance

Ryan Whetten, Lucas Maison, Titouan Parcollet, Marco Dinarelli, Yannick Est\`eve

PDF

Open Access

TL;DR

This paper introduces unsupervised metrics based on cluster quality and embedding rank that better predict SSL speech model performance during pre-training, reducing resource costs.

Contribution

It proposes novel unsupervised indicators for early evaluation of SSL speech models that outperform traditional loss-based measures.

Findings

01

Cluster quality correlates with downstream performance.

02

Embedding rank is a reliable predictor of model quality.

03

Methods require only one hour of unlabeled audio.

Abstract

In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream performance. Consequently, it is often difficult to gauge the final downstream performance in a cost efficient manner during pre-training. In this work, we propose unsupervised efficient methods that give insights into the quality of the pre-training of SSL speech models, namely, measuring the cluster quality and rank of the embeddings of the SSL model. Results show that measures of cluster quality and rank correlate better with downstream performance than the pre-training loss with only one hour of unlabeled audio, reducing the need for GPU hours and labeled data in SSL model evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis