WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo, Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long, Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu,, Furu Wei

TL;DR
WavLM is a large-scale self-supervised model trained on 94k hours of speech data, designed to improve a wide range of speech processing tasks by jointly learning speech prediction and denoising.
Contribution
The paper introduces WavLM, a novel pre-trained model that enhances full-stack speech processing by combining masked prediction, denoising, and a new positional bias mechanism.
Findings
WavLM Large achieves state-of-the-art results on the SUPERB benchmark.
WavLM significantly improves performance across various speech tasks.
Scaling up training data enhances model effectiveness.
Abstract
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising. In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. We also scale up the training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/personaplex-7b-v1model· 324k dl· ♡ 2332324k dl♡ 2332
- 🤗microsoft/wavlm-base-plusmodel· 552k dl· ♡ 36552k dl♡ 36
- 🤗microsoft/wavlm-largemodel· 351k dl· ♡ 102351k dl♡ 102
- 🤗microsoft/wavlm-base-plus-sdmodel· 124k dl· ♡ 12124k dl♡ 12
- 🤗microsoft/wavlm-base-plus-svmodel· 173k dl· ♡ 54173k dl♡ 54
- 🤗microsoft/wavlm-base-sdmodel· 13 dl13 dl
- 🤗microsoft/wavlm-base-svmodel· 31k dl· ♡ 1031k dl♡ 10
- 🤗microsoft/wavlm-basemodel· 53k dl· ♡ 1153k dl♡ 11
- 🤗D4ve-R/wavlm-base-plus-svmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗qualcomm/HuggingFace-WavLM-Base-Plusmodel· 38 dl· ♡ 538 dl♡ 5
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Adam · Dropout
