Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs

Marcin Pietro\'n; Szymon Pi\'orkowski; Kamil Faber; Dominik \.Zurek; Micha{\l} Karwatowski; Jerzy Duda; Hubert Zieli\'nski; Piotr Lipnicki; Miko{\l}aj Leszczuk

arXiv:2603.02246·eess.AS·March 4, 2026

Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs

Marcin Pietro\'n, Szymon Pi\'orkowski, Kamil Faber, Dominik \.Zurek, Micha{\l} Karwatowski, Jerzy Duda, Hubert Zieli\'nski, Piotr Lipnicki, Miko{\l}aj Leszczuk

PDF

Open Access

TL;DR

This study compares various state-of-the-art ASR models for Polish, highlighting the superior performance of Whisper and Scribe models, especially in medical interview transcription, using WER and CER metrics.

Contribution

It provides a comparative analysis of modern ASR architectures and introduces a pipeline combining ASR with LLM for improved medical transcription accuracy in Polish.

Findings

01

Whisper outperforms other open-source models in general benchmarks.

02

Scribe ElevenLabs achieves the best results on Polish medical data.

03

Models show varying robustness under different audio degradation conditions.

Abstract

This article concerns comparative studies on the Automatic Speech Recognition (ASR) model incorporated with the Large Language Model (LLM) used for medical interviews. The proposed solution is tested on polish language benchmarks and dataset with medical interviews. The latest ASR technologies are based on convolutional neural networks (CNNs), recurrent neural networks (RNNs) and Transformers. Most of them work as end-to-end solutions. The presented approach in the case of the Whisper model shows a two-stage solution with End-To-End ASR and LLM working together in a pipeline. The ASR output is an input for LLM. The LLM is a component by which the output from ASR is corrected and improved. Comparative studies for automatic recognition of the Polish language between modern End-To-End deep learning architectures and the ASR hybrid model were performed. The medical interview tests were…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · COVID-19 diagnosis using AI