AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Mark Kashirskiy; Artiom Lipinski; Ilya Makarov

arXiv:2512.18399·cs.CL·December 23, 2025

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Mark Kashirskiy, Artiom Lipinski, Ilya Makarov

PDF

Open Access

TL;DR

This paper introduces AraToken, an Arabic-specific tokenizer with normalization, and the LEP method for integrating it into Qwen3, significantly improving Arabic language processing efficiency and performance in large language models.

Contribution

The paper presents a novel Arabic tokenizer with normalization and a method to extend Qwen3 with this tokenizer, enhancing Arabic NLP tasks.

Findings

01

18% lower fertility with normalized tokenizer

02

LEP reduces evaluation loss from 8.28 to 2.43

03

Improved tokenization efficiency for Arabic language

Abstract

Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification