AutoCSF: Provably Space-Efficient Indexing of Skewed Key-Value Workloads via Filter-Augmented Compressed Static Functions

David Torres Ramos; Vihan Lakshman; Chen Luo; Todd Treangen; Benjamin Coleman

arXiv:2603.24882·cs.DS·March 27, 2026

AutoCSF: Provably Space-Efficient Indexing of Skewed Key-Value Workloads via Filter-Augmented Compressed Static Functions

David Torres Ramos, Vihan Lakshman, Chen Luo, Todd Treangen, Benjamin Coleman

PDF

Open Access

TL;DR

AutoCSF introduces a formal, space-efficient indexing method for skewed key-value datasets by combining compressed static functions with pre-filters, improving accuracy and space usage with provable guarantees.

Contribution

It provides a rigorous algorithm for integrating CSFs with pre-filters, offering theoretical guarantees and a general framework beyond Bloom filters.

Findings

01

Achieves space savings over baseline methods

02

Maintains low query latency

03

Provides formal guarantees on space efficiency

Abstract

We study the problem of building space-efficient, in-memory indexes for massive key-value datasets with highly skewed value distributions. This challenge arises in many data-intensive domains and is particularly acute in computational genomics, where $k$ -mer count tables can contain billions of entries dominated by a single frequent value. While recent work has proposed to address this problem by augmenting compressed static functions (CSFs) with pre-filters, existing approaches rely on complex heuristics and lack formal guarantees. In this paper, we introduce a principled algorithm, called AutoCSF, for combining CSFs with pre-filtering to provably handle skewed distributions with near-optimal space usage. We improve upon prior CSF pre-filtering constructions by (1) deriving a mathematically rigorous decision criterion for when filter augmentation is beneficial; (2) presenting a general…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Management and Algorithms · Algorithms and Data Compression