Perfect Hashing for Data Management Applications
Fabiano C. Botelho, Rasmus Pagh, Nivio Ziviani

TL;DR
This paper introduces a novel, theoretically optimal perfect hashing scheme that is practical for large datasets, achieving significant performance improvements and minimal space usage in data management applications.
Contribution
It presents a new perfect hashing method that combines theoretical optimality with practical efficiency, enabling scalable construction on large key sets.
Findings
Constructed minimal perfect hash functions for over one billion URLs in just over 1 hour.
Achieved space usage of slightly more than 3 bits per key, close to the theoretical minimum.
Demonstrated scalability and efficiency on commodity hardware.
Abstract
Perfect hash functions can potentially be used to compress data in connection with a variety of data management tasks. Though there has been considerable work on how to construct good perfect hash functions, there is a gap between theory and practice among all previous methods on minimal perfect hashing. On one side, there are good theoretical results without experimentally proven practicality for large key sets. On the other side, there are the theoretically analyzed time and space usage algorithms that assume that truly random hash functions are available for free, which is an unrealistic assumption. In this paper we attempt to bridge this gap between theory and practice, using a number of techniques from the literature to obtain a novel scheme that is theoretically well-understood and at the same time achieves an order-of-magnitude increase in performance compared to previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Caching and Content Delivery
