Pre-Training BERT on Arabic Tweets: Practical Considerations

Ahmed Abdelali; Sabit Hassan; Hamdy Mubarak; Kareem Darwish; Younes; Samih

arXiv:2102.10684·cs.CL·February 23, 2021·84 cites

Pre-Training BERT on Arabic Tweets: Practical Considerations

Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish, Younes, Samih

PDF

Open Access 8 Models

TL;DR

This paper explores pretraining BERT models on Arabic tweets, emphasizing data diversity and linguistic segmentation, resulting in models that achieve state-of-the-art performance on Arabic NLP tasks.

Contribution

It introduces multiple BERT models trained on diverse Arabic social media data, demonstrating the importance of data variety and linguistic segmentation for effective NLP.

Findings

01

Data diversity is crucial for model performance.

02

Linguistically aware segmentation improves results.

03

More data or training steps do not always lead to better models.

Abstract

Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the efficacy of linguistically aware segmentation. They also highlight that more data or more training step do not necessitate better models. Our new models achieve new state-of-the-art results on several downstream tasks. The resulting models are released to the community under the name QARiB.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection

MethodsLinear Layer · Attentive Walk-Aggregating Graph Neural Network · Linear Warmup With Linear Decay · Softmax · Adam · Multi-Head Attention · Residual Connection · Dropout · WordPiece · Attention Dropout