An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions
Theo Clark, Benedetta Cevoli, Eloy de Jong, Timofey Abramski, Jamie, Dougherty

TL;DR
This paper investigates multi-resolution speech SSL models, revealing that performance gains are mainly due to auxiliary losses rather than multi-scale representations, and that downsampling improves efficiency but not representation quality.
Contribution
The study provides an empirical analysis of multi-scale speech SSL architectures, challenging assumptions about their effectiveness in capturing hierarchical speech representations.
Findings
Performance improvements stem from auxiliary low-resolution loss.
Downsampling does not enhance downstream task performance.
Downsampling increases computational efficiency without improving representations.
Abstract
Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis
MethodsALIGN · Focus
