Learned LSM-trees: Two Approaches Using Learned Bloom Filters
Nicholas Fidalgo, Puyuan Ye

TL;DR
This paper investigates integrating machine learning models into LSM-tree data structures to reduce read latency and memory usage, demonstrating two approaches that improve efficiency while maintaining correctness.
Contribution
It introduces two novel methods for embedding learned models into LSM-trees, reducing read latency and memory footprint compared to traditional Bloom filters.
Findings
Classifier reduces GET latency by up to 2.28x
Learned Bloom filters eliminate false negatives and cut memory by 70-80%
Trade-offs between latency, memory, and correctness are demonstrated
Abstract
Modern key-value stores rely heavily on Log-Structured Merge (LSM) trees for write optimization, but this design introduces significant read amplification. Auxiliary structures like Bloom filters help, but impose memory costs that scale with tree depth and dataset size. Recent advances in learned data structures suggest that machine learning models can augment or replace these components, trading handcrafted heuristics for data-adaptive behavior. In this work, we explore two approaches for integrating learned predictions into the LSM-tree lookup path. The first uses a classifier to selectively bypass Bloom filter probes for irrelevant levels, aiming to reduce average-case query latency. The second replaces traditional Bloom filters with compact learned models and small backup filters, targeting memory footprint reduction without compromising correctness. We implement both methods atop a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Cloud Computing and Resource Management
