Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation
Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah

TL;DR
This paper revisits speech representation learning, emphasizing the importance of separate learnable parameters and robust data augmentation to improve encoding of different speech information types, leading to state-of-the-art results.
Contribution
It introduces O-HuBERT, a modified HuBERT model with separate parameters for different speech information types, demonstrating improved encoding and performance.
Findings
O-HuBERT effectively encodes 'other' speech information across all layers.
Robust data augmentation is crucial for tasks relying on 'other' information.
Achieves SOTA performance on SUPERB benchmark with a 100M parameter model.
Abstract
Speech modeling methods learn one embedding for a fixed segment of speech, typically in between 10-25 ms. The information present in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other) and these two are orthogonal in nature causing the optimization algorithm to find a sub-optimal solution if forced to optimize together. This leads to sub-optimal performance in one or all downstream tasks as shown by previous studies. Current self-supervised learning (SSL) methods such as HuBERT are very good at modeling the content information present in speech. Data augmentation improves the performance on tasks which require effective modeling of other information but this leads to a divided capacity of the model. In this work, we conduct a preliminary study to understand the importance of modeling other information using separate learnable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
