Pre-Training BERT on Arabic Tweets: Practical Considerations
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish, Younes, Samih

TL;DR
This paper explores pretraining BERT models on Arabic tweets, emphasizing data diversity and linguistic segmentation, resulting in models that achieve state-of-the-art performance on Arabic NLP tasks.
Contribution
It introduces multiple BERT models trained on diverse Arabic social media data, demonstrating the importance of data variety and linguistic segmentation for effective NLP.
Findings
Data diversity is crucial for model performance.
Linguistically aware segmentation improves results.
More data or training steps do not always lead to better models.
Abstract
Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the efficacy of linguistically aware segmentation. They also highlight that more data or more training step do not necessitate better models. Our new models achieve new state-of-the-art results on several downstream tasks. The resulting models are released to the community under the name QARiB.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ahmedabdelali/bert-base-qaribmodel· 1.2k dl· ♡ 91.2k dl♡ 9
- 🤗ahmedabdelali/bert-base-qarib60_1790kmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗ahmedabdelali/bert-base-qarib60_1970kmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗ahmedabdelali/bert-base-qarib60_860kmodel· 53 dl53 dl
- 🤗ahmedabdelali/bert-base-qarib_farmodel· 3 dl3 dl
- 🤗ahmedabdelali/bert-base-qarib_far_6500kmodel· 2 dl2 dl
- 🤗ahmedabdelali/bert-base-qarib_far_8280kmodel· 11 dl11 dl
- 🤗ahmedabdelali/bert-base-qarib_far_9920kmodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection
MethodsLinear Layer · Attentive Walk-Aggregating Graph Neural Network · Linear Warmup With Linear Decay · Softmax · Adam · Multi-Head Attention · Residual Connection · Dropout · WordPiece · Attention Dropout
