Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget
Andy T. Liu, Yi-Cheng Lin, Haibin Wu, Stefan Winkler, Hung-yi Lee

TL;DR
This paper explores how to efficiently train speech foundation models with self-supervised learning within limited compute resources, emphasizing the importance of model architecture, data size, and their trade-offs.
Contribution
It provides an analytical understanding of training dynamics and benchmarks SSL objectives, revealing that model architecture and data size significantly influence performance under compute constraints.
Findings
Slimmer architectures outperform small common models at same compute and parameter budgets.
Pre-training data size remains critical despite data augmentation.
An optimal model size exists for a given compute budget, balancing model and data size.
Abstract
Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward understanding the training dynamics of speech foundation models. We benchmark SSL objectives in an entirely comparable setting and find that other factors contribute more significantly to the success of SSL. Our results show that slimmer model architectures outperform common small architectures under the same compute and parameter budget. We demonstrate that the size of the pre-training data remains crucial, even with data augmentation during SSL training, as performance suffers when iterating over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
