Universal Lossless Compression with Unknown Alphabets - The Average Case
Gil I. Shamir

TL;DR
This paper investigates universal lossless compression of sequence patterns generated by i.i.d. sources with unknown and possibly large alphabets, providing bounds on redundancy and proposing low-complexity algorithms.
Contribution
It introduces new bounds on redundancy for pattern compression with unknown alphabets and presents two practical algorithms for this task.
Findings
Redundancy bounds depend on alphabet size and sequence length.
Existence of codes with redundancy decreasing as sequence length increases.
Pattern compression can outperform entropy for large alphabets.
Abstract
Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. It is shown that if the alphabet size is essentially small, then the average minimax and maximin redundancies as well as the redundancy of every code for almost every source, when compressing a pattern, consist of at least 0.5 log(n/k^3) bits per each unknown probability parameter, and if all alphabet letters are likely to occur, there exist codes whose redundancy is at most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
