GigaAM: Efficient Self-Supervised Learner for Speech Recognition

Aleksandr Kutsakov; Alexandr Maximenko; Georgii Gospodinov; Pavel Bogomolov; Fyodor Minkin

arXiv:2506.01192·eess.AS·June 3, 2025

GigaAM: Efficient Self-Supervised Learner for Speech Recognition

Aleksandr Kutsakov, Alexandr Maximenko, Georgii Gospodinov, Pavel Bogomolov, Fyodor Minkin

PDF

Open Access 1 Repo 5 Models

TL;DR

GigaAM introduces an efficient self-supervised learning framework for speech recognition that combines masked language modeling with chunkwise attention, achieving state-of-the-art results and open-sourcing models and code.

Contribution

The paper presents a novel SSL pretraining method with chunkwise attention and dynamic sampling, leading to improved speech recognition models including a state-of-the-art Russian ASR system.

Findings

01

GigaAM outperforms Whisper-large-v3 by 50% on Russian speech recognition.

02

The method scales effectively with model size and data amount.

03

Open-source models and code facilitate further research.

Abstract

Self-Supervised Learning (SSL) has demonstrated strong performance in speech processing, particularly in automatic speech recognition. In this paper, we explore an SSL pretraining framework that leverages masked language modeling with targets derived from a speech recognition model. We also present chunkwise attention with dynamic chunk size sampling during pretraining to enable both full-context and streaming fine-tuning. Our experiments examine scaling with respect to model size and the amount of data. Using our method, we train the GigaAM family of models, including a state-of-the-art model for Russian speech recognition that outperforms Whisper-large-v3 by 50%. We have released our foundation and ASR models, along with the inference code, under the MIT license as open-source resources to the research community. Available at https://github.com/salute-developers/gigaam.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salute-developers/gigaam
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need