Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification
Dania Refai, Saleh Abo-Soud, Mohammad Abdel-Rahman

TL;DR
This paper introduces a novel Arabic data augmentation method using AraGPT-2 and similarity measures, significantly improving sentiment classification accuracy across multiple datasets.
Contribution
It proposes a new Arabic data augmentation technique leveraging AraGPT-2 and similarity metrics, addressing the lack of advanced augmentation methods for Arabic NLP.
Findings
Enhanced F1 scores: 4-13% increase across datasets
Effective augmentation improves Arabic sentiment classification
Similarity measures validate generated data quality
Abstract
The performance of learning models heavily relies on the availability and adequacy of training data. To address the dataset adequacy issue, researchers have extensively explored data augmentation (DA) as a promising approach. DA generates new data instances through transformations applied to the available data, thereby increasing dataset size and variability. This approach has enhanced model performance and accuracy, particularly in addressing class imbalance problems in classification tasks. However, few studies have explored DA for the Arabic language, relying on traditional approaches such as paraphrasing or noising-based techniques. In this paper, we propose a new Arabic DA method that employs the recent powerful modeling technique, namely the AraGPT-2, for the augmentation process. The generated sentences are evaluated in terms of context, semantics, diversity, and novelty using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Text and Document Classification Technologies · Imbalanced Data Classification Techniques
