Compression and In-Situ Query Processing for Fine-Grained Array Lineage
Jinjin Zhao, Sanjay Krishnan

TL;DR
This paper presents DSLog, a system for efficient storage, compression, and in-situ querying of fine-grained array data lineage, significantly reducing storage and improving query latency.
Contribution
Introduction of ProvRC, a novel compression algorithm that enables efficient storage and in-situ querying of array lineage data, outperforming existing methods.
Findings
ProvRC achieves up to 2000x storage reduction.
ProvRC enables in-situ queries with 20x faster latency.
DSLog improves lineage query efficiency for array data.
Abstract
Tracking data lineage is important for data integrity, reproducibility, and debugging data science workflows. However, fine-grained lineage (i.e., at a cell level) is challenging to store, even for the smallest datasets. This paper introduces DSLog, a storage system that efficiently stores, indexes, and queries array data lineage, agnostic to capture methodology. A main contribution is our new compression algorithm, named ProvRC, that compresses captured lineage relationships. Using ProvRC for lineage compression result in a significant storage reduction over functions with simple spatial regularity, beating alternative columnar-store baselines by up to 2000x}. We also show that ProvRC facilitates in-situ query processing that allows forward and backward lineage queries without decompression - in the optimal case, surpassing baselines by 20x in query latency on random numpy pipelines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAntenna Design and Optimization
