Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander, Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria, Aldeneh

TL;DR
This paper investigates how different prediction targets in masked pre-training influence speech foundation models' performance, proposing new approaches to improve target design and downstream task results.
Contribution
It systematically studies the impact of prediction target design choices in speech models and introduces methods to create more informative targets for better downstream performance.
Findings
Design choices for HuBERT can be suboptimal.
Proposed approaches improve downstream task performance.
Enhanced prediction targets benefit various speech tasks.
Abstract
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework impacts their performance on downstream tasks. For instance, models pre-trained with targets that capture prosody learn representations suited for speaker-related tasks, while those pre-trained with targets that capture phonetics learn representations suited for content-related tasks. Moreover, prediction targets can differ in the level of detail they capture. Models pre-trained with targets that encode fine-grained acoustic features perform better on tasks like denoising, while those pre-trained with targets focused on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
