Instance-Optimized String Fingerprints
Mihail Stoian, Johannes Th\"urauf, Andreas Zimmerer, Alexander van Renen, Andreas Kipf

TL;DR
This paper introduces optimized string fingerprints as a lightweight indexing method to improve string query processing in cloud data warehouses, achieving significant speedups in table scans.
Contribution
It presents a novel approach to optimize string fingerprints for specific workloads using mixed-integer optimization, enhancing their effectiveness and generalization.
Findings
Up to 1.36× speedup in table scans on IMDb data
Optimized fingerprints reduce compute and I/O overhead
Method generalizes to unseen table predicates
Abstract
Recent research found that cloud data warehouses are text-heavy. However, their capabilities for efficiently processing string columns remain limited, relying primarily on techniques like dictionary encoding and prefix-based partition pruning. In recent work, we introduced string fingerprints - a lightweight secondary index structure designed to approximate LIKE predicates, albeit with false positives. This approach is particularly compelling for columnar query engines, where fingerprints can help reduce both compute and I/O overhead. We show that string fingerprints can be optimized for specific workloads using mixed-integer optimization, and that they can generalize to unseen table predicates. On an IMDb column evaluated in DuckDB v1.3, this yields table-scan speedups of up to 1.36.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiometric Identification and Security · Handwritten Text Recognition Techniques · Forensic Fingerprint Detection Methods
