Exploiting Data Skew for Improved Query Performance
Wangda Zhang, Kenneth A. Ross

TL;DR
This paper introduces a new index structure that leverages data skew to improve cache utilization and significantly enhance query performance in large-scale data analysis systems.
Contribution
It proposes a novel index design that concentrates popular data items into the same cache lines, optimizing spatial locality and cache efficiency.
Findings
Significant speedups in query performance, sometimes over tenfold.
Effective utilization of cache resources through data repositioning.
Theoretical model for analyzing cache behavior with skewed data.
Abstract
Analytic queries enable sophisticated large-scale data analysis within many commercial, scientific and medical domains today. Data skew is a ubiquitous feature of these real-world domains. In a retail database, some products are typically much more popular than others. In a text database, word frequencies follow a Zipf distribution with a small number of very common words, and a long tail of infrequent words. In a geographic database, some regions have much higher populations (and data measurements) than others. Current systems do not make the most of caches for exploiting skew. In particular, a whole cache line may remain cache resident even though only a small part of the cache line corresponds to a popular data item. In this paper, we propose a novel index structure for repositioning data items to concentrate popular items into the same cache lines. The net result is better spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
