LASER: Learning by Aligning Self-supervised Representations of Speech   for Improving Content-related Tasks

Amit Meghanani; Thomas Hain

arXiv:2406.09153·cs.CL·June 14, 2024

LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Amit Meghanani, Thomas Hain

PDF

Open Access 1 Repo

TL;DR

This paper introduces LASER, a cost-effective self-supervised fine-tuning method that aligns speech representations to improve content-related tasks like ASR and phoneme recognition, achieving significant gains with minimal computational resources.

Contribution

LASER presents a novel alignment-based fine-tuning approach using soft-DTW loss, enhancing SSL speech models efficiently for content tasks.

Findings

01

Achieves up to 11.7% relative improvement on phoneme recognition.

02

Requires less than 3 hours of fine-tuning on a single GPU.

03

Effective across models like HuBERT and WavLM.

Abstract

Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is challenging and computationally expensive. Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named "LASER: Learning by Aligning Self-supervised Representations" is presented. LASER is based on the soft-DTW alignment loss with temporal regularisation term. Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR). A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Trikaldarshi/LASER
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning