What do Speech Foundation Models Learn? Analysis and Applications
Ankita Pasad

TL;DR
This paper analyzes speech foundation models to understand their learned knowledge, evaluates their performance on speech understanding tasks, and introduces new datasets and approaches that outperform traditional methods.
Contribution
It presents a lightweight analysis framework for SFM layers, introduces new SLU tasks and datasets, and demonstrates that SFM-based end-to-end models can outperform cascaded approaches.
Findings
SFM layers encode meaningful acoustic and linguistic knowledge.
E2E SFM-based models outperform cascaded speech recognition and text models.
New datasets for spoken named entity recognition and localization enhance SLU research.
Abstract
Speech foundation models (SFMs) are designed to serve as general-purpose representations for a wide range of speech-processing tasks. The last five years have seen an influx of increasingly successful self-supervised and supervised pre-trained models with impressive performance on various downstream tasks. Although the zoo of SFMs continues to grow, our understanding of the knowledge they acquire lags behind. This thesis presents a lightweight analysis framework using statistical tools and training-free tasks to investigate the acoustic and linguistic knowledge encoded in SFM layers. We conduct a comparative study across multiple SFMs and statistical tools. Our study also shows that the analytical insights have concrete implications for downstream task performance. The effectiveness of an SFM is ultimately determined by its performance on speech applications. Yet it remains unclear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
