# Benchmarking speech-to-text robustness in noisy emergency medical dialogues: an evaluation of models under realistic acoustic conditions

**Authors:** Denis Moser, Nikola Stanic, Murat Sariyar

PMC · DOI: 10.1093/jamiaopen/ooaf147 · 2025-11-19

## TL;DR

This study evaluates how well speech-to-text systems handle noisy emergency medical conversations, finding that some models perform better under realistic conditions.

## Contribution

The study introduces a clinically relevant benchmark for evaluating speech-to-text systems in realistic emergency medical noise conditions.

## Key findings

- recapp outperformed other systems across multiple transcription accuracy metrics.
- Whisper v3 Turbo showed the lowest medical word error rate and best phrase-level accuracy.
- Dense environmental noise, like crowd chatter, most significantly degraded transcription performance.

## Abstract

To evaluate the transcription accuracy of 6 German-capable speech-to-text (STT) systems in simulated emergency medical services (EMS) environments, focusing on clinically relevant performance under noisy and multilingual field conditions.

We generated a corpus of 99 synthetic emergency dialogues and overlaid them with ecologically valid noise types—crowd chatter, traffic, public spaces, and ambulance interiors—at 5 signal-to-noise ratios (SNRs), producing 1980 noisy audio samples. Each was transcribed by 6 STT systems (recapp, Vosk, Whisper v3 variants, and RescueSpeech). We assessed performance using 5 metrics: Word Error Rate (WER), Medical Word Error Rate (mWER), TF–IDF Cosine Similarity, BLEU, and semantic embedding similarity. Statistical models quantified the effects of system, noise, and SNR on transcription fidelity.

recapp consistently outperformed all other systems across metrics. Among open-source models, Whisper v3 Turbo achieved the lowest mWER and strongest phrase-level accuracy (BLEU), while Whisper v3 Large preserved semantic content best. RescueSpeech and Vosk underperformed. “Inside crowded” noise had the most degrading impact on performance, while “talking” noise had minimal effect. Performance degradation was most pronounced at the lowest SNR (–2 dB).

STT model accuracy is highly sensitive to acoustic conditions. Clinically salient transcription errors (mWER) were most frequent under dense environmental noise. Whisper v3 Turbo balances accuracy and efficiency, suggesting strong potential for EMS applications.

This study introduces a clinically grounded, noise-robust benchmark for STT evaluation in EMS settings. It highlights the importance of domain-specific metrics and acoustic realism for deploying STT systems where transcription errors carry safety-critical consequences.

## Full-text entities

- **Genes:** F3 (coagulation factor III, tissue factor) [NCBI Gene 2152] {aka CD142, TF, TFA}
- **Diseases:** symptom (MESH:D012816), speech (MESH:D013064), cardiac arrest (MESH:D006323), ID (MESH:C537985), pain (MESH:D010146), critical (MESH:D016638), trauma (MESH:D014947), stroke (MESH:D020521)
- **Chemicals:** Salicylate (MESH:D012459), CH717305A1 (-), adrenaline (MESH:D004837), oxygen (MESH:D010100), Salbutamol (MESH:D000420)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12628192/full.md

---
Source: https://tomesphere.com/paper/PMC12628192