Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition

Zahra Rahmani (1); Hossein Sameti (1) ((1) Department of Computer Engineering; Sharif University of Technology)

arXiv:2512.17247·cs.CL·December 22, 2025

Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition

Zahra Rahmani (1), Hossein Sameti (1) ((1) Department of Computer Engineering, Sharif University of Technology)

PDF

Open Access

TL;DR

This paper introduces a noise-aware embedding method called Error Level Noise (ELN) to improve the robustness of Persian speech recognition systems in noisy environments, significantly reducing word error rates.

Contribution

It proposes a novel ELN embedding technique combined with hypothesis aggregation and fine-tuning to enhance LLM-based correction for noisy low-resource language ASR.

Findings

01

ELN-conditioned model reduces WER from 31.10% to 24.84% on noisy Persian speech.

02

ELN embeddings enable better noise uncertainty quantification and hypothesis reliability assessment.

03

The approach outperforms baseline models, demonstrating robustness in real-world noisy scenarios.

Abstract

Automatic Speech Recognition (ASR) systems suffer significant performance degradation in noisy environments, a challenge that is especially severe for low-resource languages such as Persian. Even state-of-the-art models such as Whisper struggle to maintain accuracy under varying signal-to-noise ratios (SNRs). This study presents a robust noise-sensitive ASR error correction framework that combines multiple hypotheses and noise-aware modeling. Using noisy Persian speech, we generate 5-best hypotheses from a modified Whisper-large decoder. Error Level Noise (ELN) is introduced as a representation that captures semantic- and token-level disagreement across hypotheses, quantifying the linguistic distortions caused by noise. ELN thus provides a direct measure of noise-induced uncertainty, enabling the LLM to reason about the reliability of each hypothesis during correction. Three models are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques