Ask2Mask: Guided Data Selection for Masked Speech Modeling

Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran; Yu; Zhang; Pedro Moreno

arXiv:2202.12719·cs.SD·February 28, 2022·1 cites

Ask2Mask: Guided Data Selection for Masked Speech Modeling

Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu, Zhang, Pedro Moreno

PDF

Open Access

TL;DR

This paper introduces ask2mask (ATM), a novel data selection method for masked speech modeling that uses an external scorer to focus training on more relevant speech samples, improving ASR performance especially in mismatched conditions.

Contribution

ATM is the first approach to incorporate sample-level confidence scores for targeted data selection in MSM pre-training, enhancing speech representation learning.

Findings

01

Significant improvement in recognition accuracy under mismatched conditions.

02

Up to 11.6% relative error reduction on benchmark datasets.

03

Modest gains observed in matched conditions.

Abstract

Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or \textit{scorer} to weight unsupervised input samples in two different ways: 1) A fine-grained data selection is performed by masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. 2) ATM is further extended…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling