NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech   Processing Tasks

He Huang; Taejin Park; Kunal Dhawan; Ivan Medennikov; Krishna C.; Puvvada; Nithin Rao Koluguri; Weiqing Wang; Jagadeesh Balam; Boris Ginsburg

arXiv:2408.13106·cs.SD·January 22, 2025

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C., Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access 1 Repo 10 Models

TL;DR

NEST introduces a fast, efficient self-supervised speech model using FastConformer and random projection, achieving state-of-the-art results across multiple speech tasks with reduced computational cost.

Contribution

It presents a simplified self-supervised framework with a faster architecture and novel augmentation, outperforming existing models in various speech processing tasks.

Findings

01

Achieves new state-of-the-art performance on speech recognition and translation.

02

Demonstrates improved speaker diarization and spoken language understanding.

03

Offers a computationally efficient alternative to traditional self-supervised models.

Abstract

Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVIDIA/NeMo
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis