Separate Before You Compress: The WWHO Tokenization Architecture

Kusal Darshana

arXiv:2603.25309·cs.CL·March 27, 2026

Separate Before You Compress: The WWHO Tokenization Architecture

Kusal Darshana

PDF

Open Access

TL;DR

The paper introduces WWHO, a novel tokenization architecture and SGPE algorithm that significantly reduces token counts for complex Abugida scripts, improving multilingual LLM efficiency and maintaining linguistic integrity.

Contribution

It presents a new three-layer architecture and an algorithm that separates linguistic rules from statistical encoding, enabling effective multilingual tokenization for complex scripts.

Findings

01

Achieves up to 61.7% token reduction for Sinhala

02

Extends context window by up to 4.38 times for Abugida languages

03

Maintains linguistic zero-breakage guarantee

Abstract

Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy