EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
Ashish Seth, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan, Ghosh, Dinesh Manocha

TL;DR
EH-MAM introduces an adaptive masking strategy for self-supervised speech learning, progressively challenging the model with harder regions based on frame-wise loss predictions, leading to improved speech representations.
Contribution
The paper proposes a novel selective masking approach that automatically identifies and masks harder speech regions during training, enhancing self-supervised speech representation learning.
Findings
Outperforms state-of-the-art baselines by 5-10% on low-resource speech tasks.
Effectively captures useful contextual information across speech frames.
Demonstrates the benefit of adaptive masking over random masking.
Abstract
In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Modeling (MAM), we introduce a novel selective and adaptive masking strategy. Specifically, during SSL training, we progressively introduce harder regions to the model for reconstruction. Our approach automatically selects hard regions and is built on the observation that the reconstruction loss of individual frames in MAM can provide natural signals to judge the difficulty of solving the MAM pre-text task for that frame. To identify these hard regions, we employ a teacher model that first predicts the frame-wise losses and then decides which frames to mask. By learning to create challenging problems, such as identifying harder frames and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsL1 Regularization · Adaptive Masking
