Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications
Varun Krishna, Tarun Sai, Sriram Ganapathy

TL;DR
This paper introduces a self-supervised speech representation learning method using hidden unit clustering, achieving state-of-the-art results in low-resource speech tasks without relying on textual data.
Contribution
The paper presents a novel hidden unit clustering framework for self-supervised speech representation learning, improving speaker invariance and performance on low-resource speech applications.
Findings
Achieves state-of-the-art results on ZeroSpeech 2021 tasks.
Significantly outperforms benchmarks like Wav2vec and HuBERT on semi-supervised ASR.
Effective in both unsupervised and semi-supervised speech recognition settings.
Abstract
The representation learning of speech, without textual resources, is an area of significant interest for many low resource speech applications. In this paper, we describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers. The learned "time-frequency" representations from the convolutional neural network (CNN) module are further processed with long short term memory (LSTM) layers which generate a contextual vector representation for every windowed segment. The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations. The targets consist of phoneme-like pseudo labels for each audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
