Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment

Asif Azad; MD Sadik Hossain Shanto; Mohammad Sadat Hossain; Bdour Alwuqaysi; Sabri Boughorbel; Yahya Bokhari; Abdulrhman Aljouie; Ayah Othman Sindi; Ehsan Hoque

arXiv:2604.06191·eess.AS·April 9, 2026

Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment

Asif Azad, MD Sadik Hossain Shanto, Mohammad Sadat Hossain, Bdour Alwuqaysi, Sabri Boughorbel, Yahya Bokhari, Abdulrhman Aljouie, Ayah Othman Sindi, Ehsan Hoque

PDF

TL;DR

Harf-Speech is a new modular system for Arabic phoneme-level speech assessment that achieves high correlation with expert scores and outperforms existing methods.

Contribution

It introduces a clinically aligned, interpretable framework for Arabic pronunciation assessment combining phoneme scoring, fine-tuned models, and validation against expert judgments.

Findings

01

Best model achieves 8.92% phoneme error rate.

02

Harf-Speech correlates with expert scores at 0.791 Pearson.

03

Outperforms existing assessment frameworks.

Abstract

Automated phoneme-level pronunciation assessment is vital for scalable speech therapy and language learning, yet validated tools for Arabic remain scarce. We present Harf-Speech, a modular system scoring Arabic pronunciation at the phoneme level on a clinical scale. It combines an MSA phonetizer, a fine-tuned speech-to-phoneme model, Levenshtein alignment, and a blended scorer using longest common subsequence and edit-distance metrics. We fine-tune three ASR architectures on Arabic phoneme data and benchmark them with zero-shot multimodal models; the best, OmniASR-CTC-1B-v2, achieves 8.92\% phoneme error rate. Three certified speech-language pathologists independently scored 40 utterances for clinical validation. Harf-Speech attains a Pearson correlation of 0.791 and ICC(2,1) of 0.659 with mean expert scores, outperforming existing end-to-end assessment frameworks. These results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.