Universal Lossless Compression with Unknown Alphabets - The Average Case

Gil I. Shamir

arXiv:cs/0603068·cs.IT·November 17, 2016

Universal Lossless Compression with Unknown Alphabets - The Average Case

Gil I. Shamir

PDF

TL;DR

This paper investigates universal lossless compression of sequence patterns generated by i.i.d. sources with unknown and possibly large alphabets, providing bounds on redundancy and proposing low-complexity algorithms.

Contribution

It introduces new bounds on redundancy for pattern compression with unknown alphabets and presents two practical algorithms for this task.

Findings

01

Redundancy bounds depend on alphabet size and sequence length.

02

Existence of codes with redundancy decreasing as sequence length increases.

03

Pattern compression can outperform entropy for large alphabets.

Abstract

Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. It is shown that if the alphabet size $k$ is essentially small, then the average minimax and maximin redundancies as well as the redundancy of every code for almost every source, when compressing a pattern, consist of at least 0.5 log(n/k^3) bits per each unknown probability parameter, and if all alphabet letters are likely to occur, there exist codes whose redundancy is at most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.