Open Source State-Of-the-Art Solution for Romanian Speech Recognition
Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu

TL;DR
This paper introduces a new Romanian speech recognition system using NVIDIA's FastConformer architecture, trained on extensive data, achieving state-of-the-art accuracy and efficiency across various speech benchmarks.
Contribution
The work is the first to apply FastConformer to Romanian ASR, achieving significant WER reduction and demonstrating practical decoding efficiency.
Findings
Achieved up to 27% relative WER reduction.
Performed well across read, spontaneous, and domain-specific speech.
Demonstrated practical decoding efficiency for low-latency applications.
Abstract
In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
