TL;DR
HuBERT introduces a self-supervised speech representation learning method that predicts masked hidden units using clustering, achieving state-of-the-art results on Librispeech benchmarks by combining acoustic and language modeling.
Contribution
The paper presents HuBERT, a novel approach that uses offline clustering and masked prediction to improve self-supervised speech representations, surpassing previous methods like wav2vec 2.0.
Findings
HuBERT matches or outperforms wav2vec 2.0 on Librispeech benchmarks.
Using clustering-based targets enhances speech representation learning.
HuBERT achieves up to 19% WER reduction on challenging evaluation subsets.
Abstract
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/hubert-base-ls960model· 347k dl· ♡ 74347k dl♡ 74
- 🤗espnet/xeusmodel· 33 dl· ♡ 14633 dl♡ 146
- 🤗facebook/hubert-large-ll60kmodel· 25k dl· ♡ 3525k dl♡ 35
- 🤗facebook/hubert-large-ls960-ftmodel· 133k dl· ♡ 76133k dl♡ 76
- 🤗facebook/hubert-xlarge-ll60kmodel· 224 dl· ♡ 6224 dl♡ 6
- 🤗facebook/hubert-xlarge-ls960-ftmodel· 830 dl· ♡ 16830 dl♡ 16
- 🤗omarxadel/hubert-large-arabic-egyptianmodel· 39 dl· ♡ 1939 dl♡ 19
- 🤗marcoyang/icefall-asr-librispeech-finetune-hubert-transducer-2022-12-26model· ♡ 2♡ 2
- 🤗team-lucid/hubert-base-koreanmodel· 814 dl· ♡ 31814 dl♡ 31
- 🤗team-lucid/hubert-large-koreanmodel· 36 dl· ♡ 1136 dl♡ 11
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection · WordPiece · Attention Dropout · Dense Connections
