TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect
Abir Messaoudi, Ahmed Cheikhrouhou, Hatem Haddad, Nourchene, Ferchichi, Moez BenHajhmida, Abir Korched, Malek Naski, Faten, Ghriss, Amine Kerkeni

TL;DR
This paper introduces TunBERT, a Transformer-based language model trained on noisy web data for Tunisian dialect, achieving state-of-the-art results in sentiment analysis, dialect identification, and reading comprehension.
Contribution
It is the first to train a monolingual Transformer model for Tunisian dialect using web crawled data, demonstrating competitive performance with smaller datasets.
Findings
Noisy web data is more effective than structured data for dialect modeling.
Small web datasets can achieve performance comparable to larger datasets.
TunBERT outperforms previous models in all evaluated tasks.
Abstract
Pretrained contextualized text representation models learn an effective representation of a natural language to make it machine understandable. After the breakthrough of the attention mechanism, a new generation of pretrained models have been proposed achieving good performances since the introduction of the Transformer. Bidirectional Encoder Representations from Transformers (BERT) has become the state-of-the-art model for language understanding. Despite their success, most of the available models have been trained on Indo-European languages however similar research for under-represented languages and dialects remains sparse. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for under represented languages, with a specific focus on the Tunisian dialect. We evaluate our language model on sentiment analysis task, dialect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Softmax · Residual Connection
