On Compressing Sequences for Self-Supervised Speech Models
Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia,, Hung-yi Lee, Hao Tang

TL;DR
This paper investigates sequence compression techniques in self-supervised speech models, demonstrating that subsampling can reduce computational costs and maintain or improve downstream task performance, especially with phonetic boundary information.
Contribution
It introduces fixed and variable-length subsampling methods and analyzes their impact on model performance and efficiency, highlighting benefits under low frame rates.
Findings
Subsampling improves downstream task performance at certain frame rates.
Variable-length subsampling excels at low frame rates.
Access to phonetic boundaries allows very low frame rate processing without performance loss.
Abstract
Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how individual downstream tasks are sensitive to input frame rates. Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference. Variable-length subsampling performs particularly well under low frame rates. In addition, if we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
