CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-supervised learning of speech representations
Vasista Sai Lodagala, Sreyan Ghosh, S. Umesh

TL;DR
The paper introduces ccc-wav2vec 2.0, a novel self-supervised speech representation learning method that leverages clustering and cross-contrastive loss to improve robustness and accuracy, achieving significant WER reductions on LibriSpeech and Switchboard datasets.
Contribution
It proposes a new pre-training strategy combining clustering and augmentation-based cross-contrastive loss, enhancing speech representation learning over existing methods.
Findings
Up to 15.6% WER reduction on LibriSpeech test-clean
Up to 12.7% WER reduction on LibriSpeech test-other
Up to 14.9% WER reduction on Switchboard
Abstract
While Self-Supervised Learning has helped reap the benefit of the scale from the available unlabeled data, the learning paradigms are continuously being bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation and vice-versa, bringing robustness to the pre-training strategy. ccc-wav2vec 2.0 achieves up to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on the test-clean and test-other sets, respectively, of LibriSpeech, without the use of any language model. The proposed method also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗vasista22/wav2vec2-360h-basemodel· 4 dl4 dl
- 🤗vasista22/wav2vec2-360h-base-ft-100hmodel· 2 dl2 dl
- 🤗vasista22/ccc-wav2vec2-360h-basemodel· 3 dl3 dl
- 🤗vasista22/ccc-wav2vec2-360h-base-ft-100hmodel· 2 dl2 dl
- 🤗vasista22/ccc-wav2vec2-basemodel· 2 dl· ♡ 12 dl♡ 1
- 🤗vasista22/ccc-wav2vec2-base-100hmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗vasista22/ccc-wav2vec2-base-SUPERBmodel· 4 dl· ♡ 14 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
