The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Abdulhady Abas Abdullah; Amir H. Gandomi; Tarik A Rashid; Seyedali Mirjalili; Laith Abualigah; Milena \v{Z}ivkovi\'c; Hadi Veisi

arXiv:2507.18762·cs.CL·July 28, 2025

The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Abdulhady Abas Abdullah, Amir H. Gandomi, Tarik A Rashid, Seyedali Mirjalili, Laith Abualigah, Milena \v{Z}ivkovi\'c, Hadi Veisi

PDF

Open Access

TL;DR

This paper introduces language-specific RoBERTa models for Arabic-script languages, demonstrating that script-focused pre-training improves text classification performance over general multilingual models.

Contribution

The paper presents the AS-RoBERTa family of models, pre-trained on language-specific corpora, to better capture orthographic features and improve classification results in Arabic-script languages.

Findings

01

AS-RoBERTa outperforms mBERT and XLM-RoBERTa by 2-5% in classification accuracy.

02

Script-focused pre-training is crucial for capturing language-specific orthographic patterns.

03

Error analysis reveals the impact of shared script traits and domain content on model performance.

Abstract

In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Topic Modeling · Authorship Attribution and Profiling