Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian
Yun Hao, Reihaneh Amooie, Wietse de Vries, Rik van Noord, Martijn Wieling

TL;DR
This study evaluates the effectiveness of large language models in correcting errors in low-resource Frisian ASR, demonstrating genuine improvements and analyzing correction patterns while controlling for data contamination.
Contribution
It provides the first comprehensive analysis of LLM-based error correction in low-resource ASR, including contamination control and detailed error analysis.
Findings
GER improves ASR performance in low-resource Frisian.
GPT-5.1 surpasses oracle WERs in correction accuracy.
Improvements are consistent across public and offline datasets.
Abstract
Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
