Supporting Undotted Arabic with Pre-trained Language Models

Aviad Rom; Kfir Bar

arXiv:2111.09791·cs.CL·November 19, 2021

Supporting Undotted Arabic with Pre-trained Language Models

Aviad Rom, Kfir Bar

PDF

Open Access

TL;DR

This paper investigates how pre-trained Arabic language models perform on undotted Arabic texts, which are intentionally altered to bypass content filters, and proposes methods to support such texts without retraining.

Contribution

The study introduces techniques to adapt pre-trained Arabic models for undotted texts without additional training, enhancing their robustness against obfuscation tactics.

Findings

01

Near-perfect performance on one downstream task

02

Effective methods for supporting undotted texts without retraining

03

Insights into the impact of diacritic removal on model performance

Abstract

We observe a recent behaviour on social media, in which users intentionally remove consonantal dots from Arabic letters, in order to bypass content-classification algorithms. Content classification is typically done by fine-tuning pre-trained language models, which have been recently employed by many natural-language-processing applications. In this work we study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts. We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing downstream tasks. The results are encouraging; in one of the tasks our method shows nearly perfect performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies