Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0
Robin Huo, Ewan Dunbar

TL;DR
This paper investigates how iterative pseudo-label refinement, rather than training objectives, influences the linguistic information encoded in self-supervised speech models HuBERT and wav2vec 2.0.
Contribution
It reveals that differences in linguistic representation are primarily due to training iteration, not the training objective, highlighting the importance of iterative refinement.
Findings
Representation similarity to word, phoneme, speaker identity depends on training iteration.
Training objective has less impact on linguistic information than iterative refinement.
Future research should explore why iterative refinement enhances linguistic encoding.
Abstract
Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems
