Learning-Augmented Search Data Structures
Chunkai Fu, Brandon G. Nguyen, Jung Hoon Seo, Ryan Zesch, Samson Zhou

TL;DR
This paper introduces learning-augmented skip lists and KD trees that leverage machine learning advice to optimize search times, demonstrating theoretical optimality and empirical superiority over traditional data structures even with imperfect predictions.
Contribution
It presents novel learning-augmented search data structures based on skip lists and KD trees that are provably near-optimal and robust to prediction errors.
Findings
Achieve near-optimal expected search time with potentially erroneous advice.
Maintain robustness with constant-factor approximation even with arbitrary prediction errors.
Outperform traditional data structures on synthetic and real-world datasets.
Abstract
We study the integration of machine learning advice to improve upon traditional data structure designed for efficient search queries. Although there has been recent effort in improving the performance of binary search trees using machine learning advice, e.g., Lin et. al. (ICML 2022), the resulting constructions nevertheless suffer from inherent weaknesses of binary search trees, such as complexity of maintaining balance across multiple updates and the inability to handle partially-ordered or high-dimensional datasets. For these reasons, we focus on skip lists and KD trees in this work. Given access to a possibly erroneous oracle that outputs estimated fractional frequencies for search queries on a set of items, we construct skip lists and KD trees that provably provides the optimal expected search time, within nearly a factor of two. In fact, our learning-augmented skip lists and KD…
Peer Reviews
Decision·ICLR 2025 Poster
The part on skip lists is a complete and meaningful contribution. The idea of promoting higher frequency elements to higher levels with more probability is both natural and easy to implement. Optimality as well as robustness of the proposed data structure is shown, and it is also shown to be superior in practice with both perfect and erroneous oracles.
The part on kd trees felt rushed and left me confused about both the motivation and the details of the setting. Motivation: In the paper, kd trees are considered for doing lookups for high dimensional points (as opposed to nearest neighbor search). But if one is just interested in lookups, it is unclear to me why we need kd trees. Why not just use something like the proposed skip lists after labeling the points from 1 to n? If we want to also support fast membership queries for frequent items n
1. The integration of learned frequencies into skip lists and KD trees is well-constructed and achieves optimal performance. 2. Experiments show that the proposed algorithm outperforms classical algorithms and is robust to noise. 3. The paper is clearly written and easy to read. The ideas are simple, yet novel and effective. I appreciate the clear comparison
1. The authors slightly overstate their contributions. For example, the claim of constant expected search time under the Zipfian distribution holds only when the exponent $s>1$, which is due to the entropy being of constant order (Lemma 2.3). In other words, for any data structure that achieves optimality has the same property. Additionally, in the abstract, the claim on the robustness when predictions are arbitrarily incorrect is not what you describe later. 2. The noise robustness measure, den
- Provided approach is an elegant way to improve search data structure performance by considering probabilistic nature of the data. - Given the abundance of machine learning and statistical methods to serve as an oracle, proposed approach can be widely used in practice. - New bounds were proven theoretically and experimental results partially support the theoretical findings.
- From the paper it was not fully clear how to use proposed method with real machine learning oracle models (that non-trivially predict frequencies), instead of statistical oracles (that just calculate the table of frequencies). - Some graphs, such as Figure 2, show a constant factor speed up. It would be nice to clarify what is theoretically predicted speedup in the cases of various datasets and parameters, and how theory aligns with practice.
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques
MethodsSparse Evolutionary Training
