Cache-based Multi-query Optimization for Data-intensive Scalable Computing Frameworks
Pietro Michiardi, Damiano Carra, Sara Migliorini

TL;DR
This paper presents a cache-based multi-query optimization technique for large-scale distributed systems that reduces redundant processing by sharing common subexpressions among queries, leading to improved efficiency.
Contribution
It introduces a novel method combining in-memory caching with multi-query optimization, formulated as a cost-based problem to enhance data-intensive framework performance.
Findings
Significant resource savings on TPC-DS workloads
Effective sharing of common subexpressions reduces computation
Prototype shows notable performance improvements
Abstract
In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
