Linear-Complexity Self-Supervised Learning for Speech Processing

Shucong Zhang; Titouan Parcollet; Rogier van Dalen; Sourav; Bhattacharya

arXiv:2407.13377·cs.CL·July 19, 2024

Linear-Complexity Self-Supervised Learning for Speech Processing

Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav, Bhattacharya

PDF

Open Access 1 Repo

TL;DR

This paper introduces a linear-complexity self-supervised learning model for speech processing that reduces pre-training time and resource usage while maintaining or improving performance.

Contribution

It is the first to explore linear-complexity encoders for SSL in speech, demonstrating efficiency gains over traditional MHSA-based models.

Findings

01

Reduces pre-training time by 18%.

02

Decreases peak VRAM by 23%.

03

Achieves comparable or better downstream task performance.

Abstract

Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samsunglabs/summarymixing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Speech Recognition and Synthesis