Efficient Training of Self-Supervised Speech Foundation Models on a   Compute Budget

Andy T. Liu; Yi-Cheng Lin; Haibin Wu; Stefan Winkler; Hung-yi Lee

arXiv:2409.16295·eess.AS·February 6, 2025

Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget

Andy T. Liu, Yi-Cheng Lin, Haibin Wu, Stefan Winkler, Hung-yi Lee

PDF

Open Access

TL;DR

This paper explores how to efficiently train speech foundation models with self-supervised learning within limited compute resources, emphasizing the importance of model architecture, data size, and their trade-offs.

Contribution

It provides an analytical understanding of training dynamics and benchmarks SSL objectives, revealing that model architecture and data size significantly influence performance under compute constraints.

Findings

01

Slimmer architectures outperform small common models at same compute and parameter budgets.

02

Pre-training data size remains critical despite data augmentation.

03

An optimal model size exists for a given compute budget, balancing model and data size.

Abstract

Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward understanding the training dynamics of speech foundation models. We benchmark SSL objectives in an entirely comparable setting and find that other factors contribute more significantly to the success of SSL. Our results show that slimmer model architectures outperform common small architectures under the same compute and parameter budget. We demonstrate that the size of the pre-training data remains crucial, even with data augmentation during SSL training, as performance suffers when iterating over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques