Data Caching for Enterprise-Grade Petabyte-Scale OLAP
Chunxu Tang, Bin Fan, Jing Zhao, Chen Liang, Yi Wang, Beinan Wang,, Ziyue Qiu, Lu Qiu, Bowen Ding, Shouzhuo Sun, Saiguang Che, Jiaming Mai,, Shouwei Chen, Yu Zhu, Jianjian Xie, Yutian (James) Sun, Yao Li, Yangjun, Zhang, Ke Wang, Mingmin Chen

TL;DR
This paper presents the Alluxio local cache, an architectural optimization for petabyte-scale OLAP systems that reduces network I/O and improves data transfer efficiency in large-scale enterprise environments.
Contribution
Introduction of the Alluxio local cache tailored for petabyte-scale OLAP, integrated with systems like Presto and HDFS, and validated through three years of deployment at Uber and Meta.
Findings
Significant reduction in network I/O and API call pressure.
Improved data transfer efficiency in large-scale workloads.
Effective handling of skewed and fragmented data access patterns.
Abstract
With the exponential growth of data and evolving use cases, petabyte-scale OLAP data platforms are increasingly adopting a model that decouples compute from storage. This shift, evident in organizations like Uber and Meta, introduces operational challenges including massive, read-heavy I/O traffic with potential throttling, as well as skewed and fragmented data access patterns. Addressing these challenges, this paper introduces the Alluxio local (edge) cache, a highly effective architectural optimization tailored for such environments. This embeddable cache, optimized for petabyte-scale data analytics, leverages local SSD resources to alleviate network I/O and API call pressures, significantly improving data transfer efficiency. Integrated with OLAP systems like Presto and storage services like HDFS, the Alluxio local cache has demonstrated its effectiveness in handling large-scale,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Cloud Computing and Resource Management
