Investigating the 'Autoencoder Behavior' in Speech Self-Supervised   Models: a focus on HuBERT's Pretraining

Valentin Vielzeuf

arXiv:2405.08402·cs.CL·May 15, 2024

Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

Valentin Vielzeuf

PDF

Open Access

TL;DR

This paper investigates the autoencoder-like behavior in HuBERT speech models during pretraining, aiming to understand and improve the high-level feature learning for better speech recognition performance.

Contribution

It provides an analysis of HuBERT's pretraining dynamics and proposes training modifications to enhance high-level feature extraction and downstream task performance.

Findings

01

Improved training procedures lead to faster convergence.

02

Enhanced HuBERT models achieve competitive results on speech recognition tasks.

03

Analysis reveals less autoencoder behavior in HuBERT compared to other models.

Abstract

Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech Recognition.To better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsFocus