Frequency-Ordered Tokenization for Better Text Compression

Maximilian Kalcher

arXiv:2602.22958·cs.IT·February 27, 2026

Frequency-Ordered Tokenization for Better Text Compression

Maximilian Kalcher

PDF

Open Access

TL;DR

Frequency-ordered tokenization enhances lossless text compression by reordering tokens based on frequency, leading to significant improvements in compression ratios and speed across multiple languages and compressors.

Contribution

The paper introduces a simple, effective preprocessing technique that reorders tokens by frequency to improve compression performance and speed, outperforming classical methods.

Findings

01

7.08 pp improvement for zlib on enwik8

02

Gains are consistent at 1 GB scale and across languages

03

Preprocessing accelerates compression by up to 3.1x

Abstract

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · Natural Language Processing Techniques