In-Memory Indexed Caching for Distributed Data Processing
Alexandru Uta, Bogdan Ghit, Ankur Dave, Jan Rellermeyer and, Peter Boncz

TL;DR
This paper presents the Indexed DataFrame, an in-memory caching extension for distributed data processing frameworks like Spark, enabling faster lookups and joins with minimal memory overhead, suitable for modern cloud workloads.
Contribution
Introduction of the Indexed DataFrame, a lightweight, standalone library that adds indexing and multi-version concurrency control to improve performance in distributed data processing.
Findings
Significantly faster query execution with Indexed DataFrame
Modest memory overhead observed in evaluations
Effective in cluster and cloud environments
Abstract
Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Scientific Computing and Data Management
