Chipmink: Efficient Delta Identification for Massive Object Graph
Supawit Chockchowwat, Sumay Thakurdesai, Zhaoheng Li, Matthew Krafczyk, Yongjoo Park

TL;DR
Chipmink introduces a graph-based object store that efficiently identifies and persists only modified objects in massive, evolving object graphs, significantly reducing storage and time costs in data science workflows.
Contribution
The paper presents Chipmink, a novel dynamic partitioning approach that isolates dirty objects for partial persistence, outperforming existing snapshotting methods in efficiency and storage.
Findings
Achieves up to 36.5x smaller storage size
Provides 12.4x faster persistence times
Supports diverse data science libraries across different hardware
Abstract
Ranging from batch scripts to computational notebooks, modern data science tools rely on massive and evolving object graphs that represent structured data, models, plots, and more. Persisting these objects is critical, not only to enhance system robustness against unexpected failures but also to support continuous, non-linear data exploration via versioning. Existing object persistence mechanisms (e.g., Pickle, Dill) rely on complete snapshotting, often redundantly storing unchanged objects during execution and exploration, resulting in significant inefficiency in both time and storage. Unlike DBMSs, data science systems lack centralized buffer managers that track dirty objects. Worse, object states span various locations such as memory heaps, shared memory, GPUs, and remote machines, making dirty object identification fundamentally more challenging. In this work, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Theory and Algorithms · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
