CATT: Character-based Arabic Tashkeel Transformer
Faris Alasmary, Orjuwan Zaafarani, Ahmad Ghannam

TL;DR
This paper presents a novel character-based transformer model for Arabic Text Diacritization that outperforms existing models and sets new state-of-the-art results on benchmark datasets.
Contribution
Introduces a new transformer-based approach for Arabic diacritization using pretrained character models and the Noisy-Student training method.
Findings
Achieves 30.83 ext% and 35.21 ext% relative DER reduction on WikiNews and CATT datasets.
Outperforms GPT-4-turbo in diacritization accuracy on CATT dataset.
Open-sources models and datasets for further research.
Abstract
Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83\% and 35.21\% on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Language, Linguistics, Cultural Analysis · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Residual Connection · Multi-Head Attention · WordPiece · Softmax · Layer Normalization · Attention Dropout
