LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks
Amit Meghanani, Thomas Hain

TL;DR
This paper introduces LASER, a cost-effective self-supervised fine-tuning method that aligns speech representations to improve content-related tasks like ASR and phoneme recognition, achieving significant gains with minimal computational resources.
Contribution
LASER presents a novel alignment-based fine-tuning approach using soft-DTW loss, enhancing SSL speech models efficiently for content tasks.
Findings
Achieves up to 11.7% relative improvement on phoneme recognition.
Requires less than 3 hours of fine-tuning on a single GPU.
Effective across models like HuBERT and WavLM.
Abstract
Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is challenging and computationally expensive. Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named "LASER: Learning by Aligning Self-supervised Representations" is presented. LASER is based on the soft-DTW alignment loss with temporal regularisation term. Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR). A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning
