Learning Speech Representations with Variational Predictive Coding

Sung-Lin Yeh; Peter Bell; Hao Tang

arXiv:2601.00100·eess.AS·January 5, 2026

Learning Speech Representations with Variational Predictive Coding

Sung-Lin Yeh, Peter Bell, Hao Tang

PDF

Open Access

TL;DR

This paper reveals that the HuBERT speech representation learning objective is based on variational predictive coding, offering a unifying principle that enables simple improvements and enhances performance across multiple speech tasks.

Contribution

It introduces a variational predictive coding framework for speech representation learning, providing a theoretical basis and practical modifications to improve HuBERT and related objectives.

Findings

01

Immediate performance gains on HuBERT with simple modifications

02

Significant improvements in downstream speech tasks

03

Connections established between predictive coding and other objectives

Abstract

Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis