Instance-Optimized String Fingerprints

Mihail Stoian; Johannes Th\"urauf; Andreas Zimmerer; Alexander van Renen; Andreas Kipf

arXiv:2507.10391·cs.DB·July 15, 2025

Instance-Optimized String Fingerprints

Mihail Stoian, Johannes Th\"urauf, Andreas Zimmerer, Alexander van Renen, Andreas Kipf

PDF

Open Access

TL;DR

This paper introduces optimized string fingerprints as a lightweight indexing method to improve string query processing in cloud data warehouses, achieving significant speedups in table scans.

Contribution

It presents a novel approach to optimize string fingerprints for specific workloads using mixed-integer optimization, enhancing their effectiveness and generalization.

Findings

01

Up to 1.36× speedup in table scans on IMDb data

02

Optimized fingerprints reduce compute and I/O overhead

03

Method generalizes to unseen table predicates

Abstract

Recent research found that cloud data warehouses are text-heavy. However, their capabilities for efficiently processing string columns remain limited, relying primarily on techniques like dictionary encoding and prefix-based partition pruning. In recent work, we introduced string fingerprints - a lightweight secondary index structure designed to approximate LIKE predicates, albeit with false positives. This approach is particularly compelling for columnar query engines, where fingerprints can help reduce both compute and I/O overhead. We show that string fingerprints can be optimized for specific workloads using mixed-integer optimization, and that they can generalize to unseen table predicates. On an IMDb column evaluated in DuckDB v1.3, this yields table-scan speedups of up to 1.36 $\times$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiometric Identification and Security · Handwritten Text Recognition Techniques · Forensic Fingerprint Detection Methods