Optimal compression of hash-origin prefix trees

Jarek Duda

arXiv:1206.4555·cs.IT·July 10, 2012

Optimal compression of hash-origin prefix trees

Jarek Duda

PDF

Open Access

TL;DR

This paper analyzes the informational limits of hash-origin prefix trees, proposing optimal compression methods that significantly reduce memory usage compared to standard approaches and Bloom filters.

Contribution

It introduces the asymptotic minimal bits per element for prefix trees and relates this to optimal encoding of large unordered numbers, improving memory efficiency.

Findings

01

Minimal prefix tree requires about 2.77544 bits per element.

02

Cost of distinguishability can be reduced to about 2.33275 bits per element.

03

Memory requirements can be reduced to about 0.693 of Bloom filter size.

Abstract

There is a common problem of operating on hash values of elements of some database. In this paper there will be analyzed informational content of such general task and how to practically approach such found lower boundaries. Minimal prefix tree which distinguish elements turns out to require asymptotically only about 2.77544 bits per element, while standard approaches use a few times more. While being certain of working inside the database, the cost of distinguishability can be reduced further to about 2.33275 bits per elements. Increasing minimal depth of nodes to reduce probability of false positives leads to simple relation with average depth of such random tree, which is asymptotically larger by about 1.33275 bits than lg(n) of the perfect binary tree. This asymptotic case can be also seen as a way to optimally encode n large unordered numbers - saving lg(n!) bits of information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Caching and Content Delivery · DNA and Biological Computing