Metadata Caching in Presto: Towards Fast Data Processing

Beinan Wang; Chunxu Tang; Rongrong Zhong; Bin Fan; Yi Wang; Jasmine; Wang; Shouwei Chen; Bowen Ding; Lu Zhang

arXiv:2211.10889·cs.DB·November 22, 2022

Metadata Caching in Presto: Towards Fast Data Processing

Beinan Wang, Chunxu Tang, Rongrong Zhong, Bin Fan, Yi Wang, Jasmine, Wang, Shouwei Chen, Bowen Ding, Lu Zhang

PDF

Open Access

TL;DR

This paper introduces a metadata caching layer for Presto that significantly reduces CPU consumption during data parsing, enhancing query performance in large-scale data analytics.

Contribution

It proposes a novel metadata caching approach built on Alluxio SDK, improving CPU efficiency in Presto's data processing.

Findings

01

Caching decompressed metadata reduces CPU by 10-20%.

02

Caching deserialized metadata reduces CPU by 20-40%.

03

Evaluation on TPC-DS shows substantial performance improvements.

Abstract

Presto is an open-source distributed SQL query engine for OLAP, aiming for "SQL on everything". Since open-sourced in 2013, Presto has been consistently gaining popularity in large-scale data analytics and attracting adoption from a wide range of enterprises. From the development and operation of Presto, we witnessed a significant amount of CPU consumption on parsing column-oriented data files in Presto worker nodes. This blocks some companies, including Meta, from increasing analytical data volumes. In this paper, we present a metadata caching layer, built on top of the Alluxio SDK cache and incorporated in each Presto worker node, to cache the intermediate results in file parsing. The metadata cache provides two caching methods: caching the decompressed metadata bytes from raw data files and caching the deserialized metadata objects. Our evaluation of the TPC-DS benchmark on Presto…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Data Management and Algorithms