Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Genshun Wan; Tan Liu; Hang Chen; Jia Pan; Cong Liu; Zhongfu Ye

arXiv:2212.03480·eess.AS·December 8, 2022

Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Genshun Wan, Tan Liu, Hang Chen, Jia Pan, Cong Liu, Zhongfu Ye

PDF

Open Access

TL;DR

This paper introduces PMS-SSL, a progressive multi-scale self-supervised learning approach for speech recognition that enhances model performance by integrating multi-scale structures and fine-grained target sets, leading to significant WER reductions.

Contribution

The paper proposes a novel PMS-SSL method combining multi-scale self-attention and progressive target sets to improve speech recognition accuracy.

Findings

01

Achieves 13.7% relative WER reduction on Librispeech test sets.

02

Effective in low-resource training scenarios with 10 and 100 hours.

03

Outperforms baseline HuBERT model significantly.

Abstract

Self-supervised learning (SSL) models have achieved considerable improvements in automatic speech recognition (ASR). In addition, ASR performance could be further improved if the model is dedicated to audio content information learning theoretically. To this end, we propose a progressive multi-scale self-supervised learning (PMS-SSL) method, which uses fine-grained target sets to compute SSL loss at top layer while uses coarse-grained target sets at intermediate layers. Furthermore, PMS-SSL introduces multi-scale structure into multi-head self-attention for better speech representation, which restricts the attention area into a large scope at higher layers while restricts the attention area into a small scope at lower layers. Experiments on Librispeech dataset indicate the effectiveness of our proposed method. Compared with HuBERT, PMS-SSL achieves 13.7% / 12.7% relative WER reduction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsTest