k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning
Yifan Yang, Jianheng Zhuo, Zengrui Jin, Ziyang Ma, Xiaoyu Yang,, Zengwei Yao, Liyong Guo, Wei Kang, Fangjun Kuang, Long Lin, Daniel Povey, Xie, Chen

TL;DR
k2SSL introduces an efficient, open-source framework for self-supervised speech representation learning that leverages Zipformer architectures, significantly reducing training time and memory while improving downstream speech recognition performance.
Contribution
The paper presents k2SSL, a novel framework that integrates Zipformer models into SSL, achieving faster training, lower memory usage, and superior ASR results compared to existing methods.
Findings
Zipformer-based SSL outperforms HuBERT and WavLM in WER reduction.
Significant 3.5x pre-training speedup with Zipformer Base.
Efficient scaling to 60k hours of data with comparable performance.
Abstract
Self-supervised learning (SSL) has achieved great success in speech-related tasks. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR), remain unexplored in SSL. Concurrently, inefficiencies in data processing within existing SSL training frameworks, such as fairseq, pose challenges in managing the growing volumes of training data. To address these issues, we propose k2SSL, an open-source framework that offers faster, more memory-efficient, and better-performing self-supervised speech representation learning, focusing on downstream ASR tasks. The optimized HuBERT and proposed Zipformer-based SSL systems exhibit substantial reductions in both training time and memory usage during SSL training. Experiments on LibriSpeech demonstrate that Zipformer Base significantly outperforms HuBERT and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗reazon-research/japanese-hubert-base-k2model· 292 dl· ♡ 1292 dl♡ 1
- 🤗reazon-research/japanese-hubert-base-k2-rs35khmodel· 11 dl· ♡ 111 dl♡ 1
- 🤗reazon-research/japanese-hubert-base-k2-rs35kh-bpemodel· 71 dl· ♡ 471 dl♡ 4
- 🤗reazon-research/japanese-zipformer-base-k2-rs35kh-bpemodel· 9 dl9 dl
- 🤗reazon-research/japanese-zipformer-base-k2model· 5 dl5 dl
- 🤗reazon-research/japanese-zipformer-base-k2-rs35khmodel· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsAttention Is All You Need · Residual Connection · Softmax · Adam · Label Smoothing · Dropout · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding
