SSCard: Substring Cardinality Estimation using Suffix Tree-Guided Learned FM-Index

Yirui Zhan; Wen Nie; Jun Gao

arXiv:2505.24312·cs.DB·June 2, 2025

SSCard: Substring Cardinality Estimation using Suffix Tree-Guided Learned FM-Index

Yirui Zhan, Wen Nie, Jun Gao

PDF

1 Repo

TL;DR

SSCard introduces a novel substring cardinality estimation method using a suffix tree-guided learned FM-Index, significantly improving accuracy and efficiency for database query optimization.

Contribution

It extends the FM-Index with a suffix tree structure and error-bounded spline interpolation, providing a space-efficient, accurate, and update-friendly cardinality estimator.

Findings

01

Reduces average q-error by 20%

02

Achieves 80% reduction in maximum q-error

03

Cuts construction time by 50%

Abstract

Accurate cardinality estimation of substring queries, which are commonly expressed using the SQL LIKE predicate, is crucial for query optimization in database systems. While both rule-based methods and machine learning-based methods have been developed to optimize various aspects of cardinality estimation, their absence of error bounds may result in substantial estimation errors, leading to suboptimal execution plans. In this paper, we propose SSCard, a novel SubString Cardinality estimator that leverages a space-efficient FM-Index into flexible database applications. SSCard first extends the FM-Index to support multiple strings naturally, and then organizes the FM-index using a pruned suffix tree. The suffix tree structure enables precise cardinality estimation for short patterns and achieves high compression via a pushup operation, especially on a large alphabet with skewed character…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marlcplhra/sscard
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.