FaaS and Furious: abstractions and differential caching for efficient   data pre-processing

Jacopo Tagliabue; Ryan Curtin; Ciro Greco

arXiv:2411.08203·cs.DB·November 14, 2024

FaaS and Furious: abstractions and differential caching for efficient data pre-processing

Jacopo Tagliabue, Ryan Curtin, Ciro Greco

PDF

Open Access

TL;DR

This paper presents a novel programming model and differential caching system for data pre-processing in data lakehouses, significantly improving iteration speed for data scientists by enabling transparent, cross-language, schema, and time window cache management.

Contribution

It introduces a new abstraction for data pipelines in lakehouses combined with a differential cache that enhances pre-processing efficiency across various usage patterns.

Findings

01

Cache improves iteration speed in data pre-processing tasks.

02

System works transparently across programming languages and schemas.

03

Preliminary results show efficiency gains on standard workloads.

Abstract

Data pre-processing pipelines are the bread and butter of any successful AI project. We introduce a novel programming model for pipelines in a data lakehouse, allowing users to interact declaratively with assets in object storage. Motivated by real-world industry usage patterns, we exploit these new abstractions with a columnar and differential cache to maximize iteration speed for data scientists, who spent most of their time in pre-processing - adding or removing features, restricting or relaxing time windows, wrangling current or older datasets. We show how the new cache works transparently across programming languages, schemas and time windows, and provide preliminary evidence on its efficiency on standard data workloads.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies