Separate Before You Compress: The WWHO Tokenization Architecture
Kusal Darshana

TL;DR
The paper introduces WWHO, a novel tokenization architecture and SGPE algorithm that significantly reduces token counts for complex Abugida scripts, improving multilingual LLM efficiency and maintaining linguistic integrity.
Contribution
It presents a new three-layer architecture and an algorithm that separates linguistic rules from statistical encoding, enabling effective multilingual tokenization for complex scripts.
Findings
Achieves up to 61.7% token reduction for Sinhala
Extends context window by up to 4.38 times for Abugida languages
Maintains linguistic zero-breakage guarantee
Abstract
Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy
