k2SSL: A Faster and Better Framework for Self-Supervised Speech   Representation Learning

Yifan Yang; Jianheng Zhuo; Zengrui Jin; Ziyang Ma; Xiaoyu Yang,; Zengwei Yao; Liyong Guo; Wei Kang; Fangjun Kuang; Long Lin; Daniel Povey; Xie; Chen

arXiv:2411.17100·eess.AS·March 25, 2025·ICME

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Yifan Yang, Jianheng Zhuo, Zengrui Jin, Ziyang Ma, Xiaoyu Yang,, Zengwei Yao, Liyong Guo, Wei Kang, Fangjun Kuang, Long Lin, Daniel Povey, Xie, Chen

PDF

Open Access 1 Repo 6 Models

TL;DR

k2SSL introduces an efficient, open-source framework for self-supervised speech representation learning that leverages Zipformer architectures, significantly reducing training time and memory while improving downstream speech recognition performance.

Contribution

The paper presents k2SSL, a novel framework that integrates Zipformer models into SSL, achieving faster training, lower memory usage, and superior ASR results compared to existing methods.

Findings

01

Zipformer-based SSL outperforms HuBERT and WavLM in WER reduction.

02

Significant 3.5x pre-training speedup with Zipformer Base.

03

Efficient scaling to 60k hours of data with comparable performance.

Abstract

Self-supervised learning (SSL) has achieved great success in speech-related tasks. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR), remain unexplored in SSL. Concurrently, inefficiencies in data processing within existing SSL training frameworks, such as fairseq, pose challenges in managing the growing volumes of training data. To address these issues, we propose k2SSL, an open-source framework that offers faster, more memory-efficient, and better-performing self-supervised speech representation learning, focusing on downstream ASR tasks. The optimized HuBERT and proposed Zipformer-based SSL systems exhibit substantial reductions in both training time and memory usage during SSL training. Experiments on LibriSpeech demonstrate that Zipformer Base significantly outperforms HuBERT and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k2-fsa/icefall
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsAttention Is All You Need · Residual Connection · Softmax · Adam · Label Smoothing · Dropout · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding