Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition

Ayhan Kucukmanisa; Derya Gelmez; Sukru Selim Calik; Zeynep Hilal Kilimci

arXiv:2511.17477·cs.SD·November 24, 2025

Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition

Ayhan Kucukmanisa, Derya Gelmez, Sukru Selim Calik, Zeynep Hilal Kilimci

PDF

Open Access

TL;DR

This study introduces a transformer-based multimodal framework combining acoustic and textual data for Arabic phoneme mispronunciation detection, significantly improving accuracy in Quranic recitation analysis.

Contribution

It presents a novel multimodal deep learning approach integrating UniSpeech and BERT embeddings with various fusion strategies for enhanced phoneme mispronunciation detection.

Findings

01

UniSpeech-BERT configuration yields high accuracy

02

Fusion strategies improve detection robustness

03

Model generalizes well across diverse datasets

Abstract

Recent advances in multimodal deep learning have greatly enhanced the capability of systems for speech analysis and pronunciation assessment. Accurate pronunciation detection remains a key challenge in Arabic, particularly in the context of Quranic recitation, where subtle phonetic differences can alter meaning. Addressing this challenge, the present study proposes a transformer-based multimodal framework for Arabic phoneme mispronunciation detection that combines acoustic and textual representations to achieve higher precision and robustness. The framework integrates UniSpeech-derived acoustic embeddings with BERT-based textual embeddings extracted from Whisper transcriptions, creating a unified representation that captures both phonetic detail and linguistic context. To determine the most effective integration strategy, early, intermediate, and late fusion methods were implemented and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Emotion and Mood Recognition