Optimal alphabet for single text compression
Armen E. Allahverdyan, Andranik Khachatryan

TL;DR
This study evaluates various alphabets for text compression using Huffman coding, finding that syllables or words offer the best compression efficiency when considering full code length, including the codebook.
Contribution
It systematically compares different alphabet choices for compression, highlighting the importance of codebook representation and identifying syllables and words as optimal alphabets.
Findings
Syllables and words minimize full code length for most texts.
Letter 3- and 4-grams perform worse than syllables or words.
Compact codebook representation improves compression, especially for large symbol sets.
Abstract
A text written using symbols from a given alphabet can be compressed using the Huffman code, which minimizes the length of the encoded text. It is necessary, however, to employ a text-specific codebook, i.e. the symbol-codeword dictionary, to decode the original text. Thus, the compression performance should be evaluated by the full code length, i.e. the length of the encoded text plus the length of the codebook. We studied several alphabets for compressing texts -- letters, n-grams of letters, syllables, words, and phrases. If only sufficiently short texts are retained, an alphabet of letters or two-grams of letters is optimal. For the majority of Project Gutenberg texts, the best alphabet (the one that minimizes the full code length) is given by syllables or words, depending on the representation of the codebook. Letter 3 and 4-grams, having on average comparable length to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Artificial Intelligence in Games
