Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text Dataset
Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov

TL;DR
This paper evaluates various end-to-end ASR models on the large Russian OpenSTT dataset, comparing their performance to a hybrid system across different speech domains.
Contribution
It provides a comprehensive comparison of end-to-end ASR approaches with hybrid models on a large open-source Russian speech dataset.
Findings
End-to-end models achieve comparable WER to hybrid systems on validation sets.
Transformer-based models perform best among end-to-end approaches.
The study offers insights into the effectiveness of different ASR architectures for Russian speech.
Abstract
This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set -- OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybridASR system demonstrates 33.5%, 20.9%, and 18.6% WER.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding
