Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced   Arabic Language Models

Mohamed Taher Alrefaie; Nour Eldin Morsy; Nada Samir

arXiv:2403.11130·cs.CL·September 23, 2024·2 cites

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

Mohamed Taher Alrefaie, Nour Eldin Morsy, Nada Samir

PDF

Open Access 1 Repo

TL;DR

This study evaluates how different tokenization strategies and vocabulary sizes affect Arabic language model performance across various NLP tasks, highlighting BPE with Farasa as the most effective approach.

Contribution

It provides a comprehensive analysis of tokenization and vocabulary impacts on Arabic NLP models, emphasizing morphological analysis and dialect challenges, which are less explored in prior work.

Findings

01

BPE with Farasa outperforms other tokenization strategies in multiple tasks.

02

Vocabulary size has limited impact on model performance.

03

Dialect-specific segmentation issues affect sentiment analysis accuracy.

Abstract

This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural language processing tasks. Our investigation focused on the effectiveness of four tokenizers across various tasks, including News Classification, Hate Speech Detection, Sentiment Analysis, and Natural Language Inference. Leveraging a diverse set of vocabulary sizes, we scrutinize the intricate interplay between tokenization approaches and model performance. The results reveal that Byte Pair Encoding (BPE) with Farasa outperforms other strategies in multiple tasks, underscoring the significance of morphological analysis in capturing the nuances of the Arabic language. However, challenges arise in sentiment analysis, where dialect specific segmentation issues impact model efficiency. Computational efficiency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nourmorsy/PremioLLM
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSparse Evolutionary Training · Byte Pair Encoding