Pisets: A Robust Speech Recognition System for Lectures and Interviews

Ivan Bondarenko; Daniil Grebenkin; Oleg Sedukhin; Mikhail Klementev; Roman Derunets; Lyudmila Budneva

arXiv:2601.18415·cs.CL·January 27, 2026

Pisets: A Robust Speech Recognition System for Lectures and Interviews

Ivan Bondarenko, Daniil Grebenkin, Oleg Sedukhin, Mikhail Klementev, Roman Derunets, Lyudmila Budneva

PDF

Open Access 2 Models 1 Video

TL;DR

Pisets is a robust speech recognition system combining multiple models and advanced techniques to improve transcription accuracy for long, diverse Russian audio recordings, outperforming existing models.

Contribution

The paper introduces a novel three-component architecture with curriculum learning and uncertainty modeling to enhance speech recognition accuracy and robustness.

Findings

01

Outperforms WhisperX and Whisper in accuracy and robustness

02

Effective in transcribing long audio across various acoustic conditions

03

Utilizes curriculum learning and uncertainty modeling for improvements

Abstract

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Pisets: A Robust Speech Recognition System for Lectures and Interviews· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing