AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3
Mark Kashirskiy, Artiom Lipinski, Ilya Makarov

TL;DR
This paper introduces AraToken, an Arabic-specific tokenizer with normalization, and the LEP method for integrating it into Qwen3, significantly improving Arabic language processing efficiency and performance in large language models.
Contribution
The paper presents a novel Arabic tokenizer with normalization and a method to extend Qwen3 with this tokenizer, enhancing Arabic NLP tasks.
Findings
18% lower fertility with normalized tokenizer
LEP reduces evaluation loss from 8.28 to 2.43
Improved tokenization efficiency for Arabic language
Abstract
Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
