EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech   Representation Learning

Ashish Seth; Ramaneswaran Selvakumar; S Sakshi; Sonal Kumar; Sreyan; Ghosh; Dinesh Manocha

arXiv:2410.13179·cs.SD·October 18, 2024

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

Ashish Seth, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan, Ghosh, Dinesh Manocha

PDF

Open Access 1 Repo 1 Video

TL;DR

EH-MAM introduces an adaptive masking strategy for self-supervised speech learning, progressively challenging the model with harder regions based on frame-wise loss predictions, leading to improved speech representations.

Contribution

The paper proposes a novel selective masking approach that automatically identifies and masks harder speech regions during training, enhancing self-supervised speech representation learning.

Findings

01

Outperforms state-of-the-art baselines by 5-10% on low-resource speech tasks.

02

Effectively captures useful contextual information across speech frames.

03

Demonstrates the benefit of adaptive masking over random masking.

Abstract

In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Modeling (MAM), we introduce a novel selective and adaptive masking strategy. Specifically, during SSL training, we progressively introduce harder regions to the model for reconstruction. Our approach automatically selects hard regions and is built on the observation that the reconstruction loss of individual frames in MAM can provide natural signals to judge the difficulty of solving the MAM pre-text task for that frame. To identify these hard regions, we employ a teacher model that first predicts the frame-wise losses and then decides which frames to mask. By learning to create challenging problems, such as identifying harder frames and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cs20s030/ehmam
pytorchOfficial

Videos

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsL1 Regularization · Adaptive Masking