Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition
Ayhan Kucukmanisa, Derya Gelmez, Sukru Selim Calik, Zeynep Hilal Kilimci

TL;DR
This study introduces a transformer-based multimodal framework combining acoustic and textual data for Arabic phoneme mispronunciation detection, significantly improving accuracy in Quranic recitation analysis.
Contribution
It presents a novel multimodal deep learning approach integrating UniSpeech and BERT embeddings with various fusion strategies for enhanced phoneme mispronunciation detection.
Findings
UniSpeech-BERT configuration yields high accuracy
Fusion strategies improve detection robustness
Model generalizes well across diverse datasets
Abstract
Recent advances in multimodal deep learning have greatly enhanced the capability of systems for speech analysis and pronunciation assessment. Accurate pronunciation detection remains a key challenge in Arabic, particularly in the context of Quranic recitation, where subtle phonetic differences can alter meaning. Addressing this challenge, the present study proposes a transformer-based multimodal framework for Arabic phoneme mispronunciation detection that combines acoustic and textual representations to achieve higher precision and robustness. The framework integrates UniSpeech-derived acoustic embeddings with BERT-based textual embeddings extracted from Whisper transcriptions, creating a unified representation that captures both phonetic detail and linguistic context. To determine the most effective integration strategy, early, intermediate, and late fusion methods were implemented and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Emotion and Mood Recognition
