TunBERT: Pretrained Contextualized Text Representation for Tunisian   Dialect

Abir Messaoudi; Ahmed Cheikhrouhou; Hatem Haddad; Nourchene; Ferchichi; Moez BenHajhmida; Abir Korched; Malek Naski; Faten; Ghriss; Amine Kerkeni

arXiv:2111.13138·cs.CL·November 29, 2021

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Abir Messaoudi, Ahmed Cheikhrouhou, Hatem Haddad, Nourchene, Ferchichi, Moez BenHajhmida, Abir Korched, Malek Naski, Faten, Ghriss, Amine Kerkeni

PDF

Open Access

TL;DR

This paper introduces TunBERT, a Transformer-based language model trained on noisy web data for Tunisian dialect, achieving state-of-the-art results in sentiment analysis, dialect identification, and reading comprehension.

Contribution

It is the first to train a monolingual Transformer model for Tunisian dialect using web crawled data, demonstrating competitive performance with smaller datasets.

Findings

01

Noisy web data is more effective than structured data for dialect modeling.

02

Small web datasets can achieve performance comparable to larger datasets.

03

TunBERT outperforms previous models in all evaluated tasks.

Abstract

Pretrained contextualized text representation models learn an effective representation of a natural language to make it machine understandable. After the breakthrough of the attention mechanism, a new generation of pretrained models have been proposed achieving good performances since the introduction of the Transformer. Bidirectional Encoder Representations from Transformers (BERT) has become the state-of-the-art model for language understanding. Despite their success, most of the available models have been trained on Indo-European languages however similar research for under-represented languages and dialects remains sparse. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for under represented languages, with a specific focus on the Tunisian dialect. We evaluate our language model on sentiment analysis task, dialect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Softmax · Residual Connection