Supporting Undotted Arabic with Pre-trained Language Models
Aviad Rom, Kfir Bar

TL;DR
This paper investigates how pre-trained Arabic language models perform on undotted Arabic texts, which are intentionally altered to bypass content filters, and proposes methods to support such texts without retraining.
Contribution
The study introduces techniques to adapt pre-trained Arabic models for undotted texts without additional training, enhancing their robustness against obfuscation tactics.
Findings
Near-perfect performance on one downstream task
Effective methods for supporting undotted texts without retraining
Insights into the impact of diacritic removal on model performance
Abstract
We observe a recent behaviour on social media, in which users intentionally remove consonantal dots from Arabic letters, in order to bypass content-classification algorithms. Content classification is typically done by fine-tuning pre-trained language models, which have been recently employed by many natural-language-processing applications. In this work we study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts. We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing downstream tasks. The results are encouraging; in one of the tasks our method shows nearly perfect performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
