Frequency-Ordered Tokenization for Better Text Compression
Maximilian Kalcher

TL;DR
Frequency-ordered tokenization enhances lossless text compression by reordering tokens based on frequency, leading to significant improvements in compression ratios and speed across multiple languages and compressors.
Contribution
The paper introduces a simple, effective preprocessing technique that reorders tokens by frequency to improve compression performance and speed, outperforming classical methods.
Findings
7.08 pp improvement for zlib on enwik8
Gains are consistent at 1 GB scale and across languages
Preprocessing accelerates compression by up to 3.1x
Abstract
We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · Natural Language Processing Techniques
