BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Sander Land; Catherine Arnett

arXiv:2505.24689·cs.CL·June 2, 2025

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Sander Land, Catherine Arnett

PDF

Open Access 1 Repo

TL;DR

This paper introduces SCRIPT, a new pretokenization scheme that uses Unicode script and category properties to improve multilingual BPE tokenization, enhancing robustness and fairness across diverse scripts.

Contribution

We propose SCRIPT, a rule-based pretokenization method that avoids UTF-8 issues and improves multilingual BPE by respecting script boundaries and enforcing character integrity.

Findings

01

SCRIPT-BPE achieves competitive compression rates.

02

Eliminates encoding penalties for non-Latin scripts.

03

Provides a robust alternative to regex-based pretokens.

Abstract

Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sanderland/script_bpe
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsByte Pair Encoding