BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Sander Land, Catherine Arnett

TL;DR
This paper introduces SCRIPT, a new pretokenization scheme that uses Unicode script and category properties to improve multilingual BPE tokenization, enhancing robustness and fairness across diverse scripts.
Contribution
We propose SCRIPT, a rule-based pretokenization method that avoids UTF-8 issues and improves multilingual BPE by respecting script boundaries and enforcing character integrity.
Findings
SCRIPT-BPE achieves competitive compression rates.
Eliminates encoding penalties for non-Latin scripts.
Provides a robust alternative to regex-based pretokens.
Abstract
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsByte Pair Encoding
