Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text   Dataset

Andrei Andrusenko; Aleksandr Laptev; Ivan Medennikov

arXiv:2006.08274·eess.AS·October 8, 2020

Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text Dataset

Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov

PDF

TL;DR

This paper evaluates various end-to-end ASR models on the large Russian OpenSTT dataset, comparing their performance to a hybrid system across different speech domains.

Contribution

It provides a comprehensive comparison of end-to-end ASR approaches with hybrid models on a large open-source Russian speech dataset.

Findings

01

End-to-end models achieve comparable WER to hybrid systems on validation sets.

02

Transformer-based models perform best among end-to-end approaches.

03

The study offers insights into the effectiveness of different ASR architectures for Russian speech.

Abstract

This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set -- OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybridASR system demonstrates 33.5%, 20.9%, and 18.6% WER.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding