WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech   Processing

Sanyuan Chen; Chengyi Wang; Zhengyang Chen; Yu Wu; Shujie Liu; Zhuo; Chen; Jinyu Li; Naoyuki Kanda; Takuya Yoshioka; Xiong Xiao; Jian Wu; Long; Zhou; Shuo Ren; Yanmin Qian; Yao Qian; Jian Wu; Michael Zeng; Xiangzhan Yu,; Furu Wei

arXiv:2110.13900·cs.CL·November 23, 2022

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo, Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long, Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu,, Furu Wei

PDF

5 Repos 10 Models

TL;DR

WavLM is a large-scale self-supervised model trained on 94k hours of speech data, designed to improve a wide range of speech processing tasks by jointly learning speech prediction and denoising.

Contribution

The paper introduces WavLM, a novel pre-trained model that enhances full-stack speech processing by combining masked prediction, denoising, and a new positional bias mechanism.

Findings

01

WavLM Large achieves state-of-the-art results on the SUPERB benchmark.

02

WavLM significantly improves performance across various speech tasks.

03

Scaling up training data enhances model effectiveness.

Abstract

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising. In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. We also scale up the training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Adam · Dropout