Tibyan Corpus: Balanced and Comprehensive Error Coverage Corpus Using ChatGPT for Arabic Grammatical Error Correction
Ahlam Alrehili, Areej Alhothali

TL;DR
This paper introduces the Tibyan corpus, a large, balanced Arabic grammatical error correction dataset created using ChatGPT for data augmentation, addressing resource limitations in Arabic NLP.
Contribution
The study develops a novel Arabic GEC corpus using ChatGPT for data augmentation, combined with expert validation, to improve resource availability for Arabic grammatical error correction.
Findings
Tibyan corpus contains approximately 600K tokens.
Includes 49 error types across orthography, morphology, syntax, semantics, punctuation, merge, split.
Validated with linguistic experts to ensure accuracy.
Abstract
Natural language processing (NLP) utilizes text data augmentation to overcome sample size constraints. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC). Furthermore, QALB-14 and QALB-15 are the only datasets used in most Arabic grammatical error correction research, with approximately 20,500 parallel examples, which is considered low compared with other languages. Therefore, this study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
