Speech Representation Learning Revisited: The Necessity of Separate   Learnable Parameters and Robust Data Augmentation

Hemant Yadav; Sunayana Sitaram; Rajiv Ratn Shah

arXiv:2408.10557·cs.CL·March 4, 2025

Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation

Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah

PDF

Open Access

TL;DR

This paper revisits speech representation learning, emphasizing the importance of separate learnable parameters and robust data augmentation to improve encoding of different speech information types, leading to state-of-the-art results.

Contribution

It introduces O-HuBERT, a modified HuBERT model with separate parameters for different speech information types, demonstrating improved encoding and performance.

Findings

01

O-HuBERT effectively encodes 'other' speech information across all layers.

02

Robust data augmentation is crucial for tasks relying on 'other' information.

03

Achieves SOTA performance on SUPERB benchmark with a 100M parameter model.

Abstract

Speech modeling methods learn one embedding for a fixed segment of speech, typically in between 10-25 ms. The information present in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other) and these two are orthogonal in nature causing the optimization algorithm to find a sub-optimal solution if forced to optimize together. This leads to sub-optimal performance in one or all downstream tasks as shown by previous studies. Current self-supervised learning (SSL) methods such as HuBERT are very good at modeling the content information present in speech. Data augmentation improves the performance on tasks which require effective modeling of other information but this leads to a divided capacity of the model. In this work, we conduct a preliminary study to understand the importance of modeling other information using separate learnable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis