Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining
Valentin Vielzeuf

TL;DR
This paper investigates the autoencoder-like behavior in HuBERT speech models during pretraining, aiming to understand and improve the high-level feature learning for better speech recognition performance.
Contribution
It provides an analysis of HuBERT's pretraining dynamics and proposes training modifications to enhance high-level feature extraction and downstream task performance.
Findings
Improved training procedures lead to faster convergence.
Enhanced HuBERT models achieve competitive results on speech recognition tasks.
Analysis reveals less autoencoder behavior in HuBERT compared to other models.
Abstract
Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech Recognition.To better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsFocus
