AutoCSF: Provably Space-Efficient Indexing of Skewed Key-Value Workloads via Filter-Augmented Compressed Static Functions
David Torres Ramos, Vihan Lakshman, Chen Luo, Todd Treangen, Benjamin Coleman

TL;DR
AutoCSF introduces a formal, space-efficient indexing method for skewed key-value datasets by combining compressed static functions with pre-filters, improving accuracy and space usage with provable guarantees.
Contribution
It provides a rigorous algorithm for integrating CSFs with pre-filters, offering theoretical guarantees and a general framework beyond Bloom filters.
Findings
Achieves space savings over baseline methods
Maintains low query latency
Provides formal guarantees on space efficiency
Abstract
We study the problem of building space-efficient, in-memory indexes for massive key-value datasets with highly skewed value distributions. This challenge arises in many data-intensive domains and is particularly acute in computational genomics, where -mer count tables can contain billions of entries dominated by a single frequent value. While recent work has proposed to address this problem by augmenting compressed static functions (CSFs) with pre-filters, existing approaches rely on complex heuristics and lack formal guarantees. In this paper, we introduce a principled algorithm, called AutoCSF, for combining CSFs with pre-filtering to provably handle skewed distributions with near-optimal space usage. We improve upon prior CSF pre-filtering constructions by (1) deriving a mathematically rigorous decision criterion for when filter augmentation is beneficial; (2) presenting a general…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Algorithms and Data Compression
