MSRS: Training Multimodal Speech Recognition Models from Scratch with   Sparse Mask Optimization

Adriana Fernandez-Lopez; Honglie Chen; Pingchuan Ma; Lu Yin; Qiao; Xiao; Stavros Petridis; Shiwei Liu; Maja Pantic

arXiv:2406.17614·cs.CV·June 26, 2024

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao, Xiao, Stavros Petridis, Shiwei Liu, Maja Pantic

PDF

Open Access

TL;DR

This paper introduces MSRS, a sparse regularization technique enabling training of multimodal speech recognition models from scratch, reducing training time and achieving competitive accuracy without pre-training.

Contribution

MSRS is a novel regularization method that learns sparse structures early in training, allowing models to be trained from scratch with improved efficiency and competitive performance.

Findings

01

Achieves 21.1% WER on LRS3 benchmark for VSR

02

Reduces training time by at least 2x

03

Enables training from scratch by masking vanishing gradients

Abstract

Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbf{MSRS} (Multimodal Speech Recognition from Scratch), introduces a sparse regularization that rapidly learns sparse structures within the dense model at the very beginning of training, which receives healthier gradient flow than the dense equivalent. Once the sparse mask stabilizes, our method allows transitioning to a dense model or keeping a sparse model by updating non-zero values. MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing