NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C., Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

TL;DR
NEST introduces a fast, efficient self-supervised speech model using FastConformer and random projection, achieving state-of-the-art results across multiple speech tasks with reduced computational cost.
Contribution
It presents a simplified self-supervised framework with a faster architecture and novel augmentation, outperforming existing models in various speech processing tasks.
Findings
Achieves new state-of-the-art performance on speech recognition and translation.
Demonstrates improved speaker diarization and spoken language understanding.
Offers a computationally efficient alternative to traditional self-supervised models.
Abstract
Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/diar_sortformer_4spk-v1model· 5.3k dl· ♡ 1375.3k dl♡ 137
- 🤗nvidia/diar_streaming_sortformer_4spk-v2model· 23k dl· ♡ 11123k dl♡ 111
- 🤗nvidia/diar_streaming_sortformer_4spk-v2.1model· 6.5k dl· ♡ 596.5k dl♡ 59
- 🤗nvidia/ssl_en_nest_large_v1.0model· 57 dl· ♡ 857 dl♡ 8
- 🤗nvidia/ssl_en_nest_xlarge_v1.0model· 156 dl· ♡ 7156 dl♡ 7
- 🤗nvidia/multitalker-parakeet-streaming-0.6b-v1model· 525 dl· ♡ 94525 dl♡ 94
- 🤗sbintuitions/nest-ja-0.1bmodel· 17 dl· ♡ 317 dl♡ 3
- 🤗sbintuitions/nest-ja-0.6bmodel· 9 dl· ♡ 79 dl♡ 7
- 🤗everyscribe/diar_streaming_sortformer_4spk-v2model· 3 dl3 dl
- 🤗thoratsr7/multitalker-parakeet-streaming-0.6b-v1model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
