CATT: Character-based Arabic Tashkeel Transformer

Faris Alasmary; Orjuwan Zaafarani; Ahmad Ghannam

arXiv:2407.03236·cs.CL·July 16, 2024

CATT: Character-based Arabic Tashkeel Transformer

Faris Alasmary, Orjuwan Zaafarani, Ahmad Ghannam

PDF

Open Access 1 Repo 2 Models 1 Datasets 1 Video

TL;DR

This paper presents a novel character-based transformer model for Arabic Text Diacritization that outperforms existing models and sets new state-of-the-art results on benchmark datasets.

Contribution

Introduces a new transformer-based approach for Arabic diacritization using pretrained character models and the Noisy-Student training method.

Findings

01

Achieves 30.83 ext% and 35.21 ext% relative DER reduction on WikiNews and CATT datasets.

02

Outperforms GPT-4-turbo in diacritization accuracy on CATT dataset.

03

Open-sources models and datasets for further research.

Abstract

Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83\% and 35.21\% on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

abjadai/catt
pytorchOfficial

Models

Datasets

Bisher/CATT_benchmark
dataset· 5 dl
5 dl

Videos

CATT: Character-based Arabic Tashkeel Transformer· underline

Taxonomy

TopicsHandwritten Text Recognition Techniques · Language, Linguistics, Cultural Analysis · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Residual Connection · Multi-Head Attention · WordPiece · Softmax · Layer Normalization · Attention Dropout