Learning Speech Representations with Variational Predictive Coding
Sung-Lin Yeh, Peter Bell, Hao Tang

TL;DR
This paper reveals that the HuBERT speech representation learning objective is based on variational predictive coding, offering a unifying principle that enables simple improvements and enhances performance across multiple speech tasks.
Contribution
It introduces a variational predictive coding framework for speech representation learning, providing a theoretical basis and practical modifications to improve HuBERT and related objectives.
Findings
Immediate performance gains on HuBERT with simple modifications
Significant improvements in downstream speech tasks
Connections established between predictive coding and other objectives
Abstract
Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
